Statistical postprocessing is a method to improve the direct model output from numerical weather prediction (NWP) models. In this assignment we use linear regression models that use observed air temperature at 2 meters (T2Obs) and the dewpoint temperature at 2 meters (D2Obs) to improve the forecast capability of the NWP models. We develop linear regression model for the air temperature at 2 meters (T2) and the dewpoint temperature at 2 meters (D2). We compare our model with the standard models available in R such as lm, step and glm. We then use the predicted T2 and D2 to calculate the relative humidity, the root mean square error (rmse) for the RH for ECMWF NWP model and for the linear models. The results show that linear regression models reduces the rmse in RH at most by a factor of 2. The data used for analysis is the weather data for the Kumpula weather station for the four seasons, for 64 forecast periods produced by ECMWF and the T2Obs and D2Obs available at the Finnish Meteorological Institute.
Statistical postprocessing is a method to improve the direct model output from numerical weather prediction models. It enhances the NWP’s ability to forecast by relating the model output to observational or additional model data using statistical methods. It incorporates the local observations to the model output to reduce the systematic biases of the model.
Model Output Statistics (MOS) is a type of statistical post-processing technique. MOS was defined by Glahn and Lowry in 1972 as the following: Model Output Statistics is an objective weather forecasting technique which consists of determining a statistical relationship between a predictand and variables forecast by a numerical model at some projection time(s). It is, in effect, the determination of the “weather related” statistics of a numerical model.
The Finnish Meteorological Institute (FMI) has started a project called POSSE that aims to improve the accuracy of weather forecasts by incorporating the raw forecasts and local observations with statistical models. In POSSE, the methodology for statistical post-processing of weather forecasts is the MOS and it consists of determining a statistical relationship between the response variable, for example, the observed temperature, and the predictor variables such as mean sea level pressure,high cloud cover etc. The vision of POSSE is to enable FMI to produce cost-efficient, reliable, and competitive forecasts in Finland and around the world.
In the POSSE project, work has already been done in improving the forecasts by incorporating the observed temperature at 2 meters with the numerical weather model obtained from ECMWF (European Centre for Medium-Range Weather Forecasts). This assignment problem is a part of my work in the POSSE project where we analyze the interdependence of variables of the forecast model.
Specifically in this assignment, we carry out the interdependence of T2, the air temperature at 2 meters, and D2, the dewpoint temperature at 2 meters.
Dewpoint temperature indicates the amount moisture in the air. A high dewpoint indicate a higher moisture content of the air at a given temperature. The dewpoint temperature is defined as the the temperature at which the air cannot hold all the water vapour which is mixed with it and it has to condense some water vapour. The dew point temperature is always lower than or equal to the air temperature.
Relative humidity (RH) indicates how moist the air is. The air has high high relative humidity, when the air temperature and the dew point temperatures are very close. When there is a large difference between air and dew point temperatures the air has a low relative humidity. The dew point is associated with relative humidity. A high relative humidity indicates that the dew point is closer to the current air temperature. Relative humidity of 100% indicates the dew point is equal to the current air temperature and that the air is maximally saturated with water. When the dew point remains constant and temperature increases, relative humidity decreases.
We have the ECMWF model outputs on T2, D2 and also the observed temperature and dewpoint temperature at 2 meters, T2Obs and D2Obs respectively. We develop a linear regression model with response variable T2Obs and the model forecast from ECMWF and predict T2. Similar linear regression is carried out with response variable D2Obs and the model forecast from ECMWF to predict D2. We then calculate the Relative Humidity (RH) which is a function of T2 and D2. We compare our linear regression model with the linear regression models available in R simple linear regression(lm), linear regression with step method and linear regression with cross validation (glm.cv).
The results of this work are the following: 1. Our linear regression model is comparable with the lm, lmstep and glm models. 2. The errors in the prediction of variables amplifies the errors in the variable which depend on the predicted variables. For example, errors in relative humidity amplifies the errors in the predicted air temperature at 2 meters (T2) and the dewpoint temperature at 2 meters (D2). 3. The results show that linear regression models reduces the rmse in RH at most by a factor of 2. 4. The statistical postprocessing with MOS methodology where we use the linear regression model with observed air temperature at 2 meters (T2Obs) and the dewpoint temperature at 2 meters (D2Obs) improves the forecast model.
The work on interdependence of variable has been assigned to me by Jussi Ylhäisi (jussi.ylhaisi@fmi.fi) in January 2017. The linear regression model on T2 using the T2Obs and the the ECMWF raw model data is already existed in FMI. I developed the linear regression model on D2 with D2Obs and ECMWF raw model data and calculated the RH based on T2 and D2.
Discussions with Jussi Ylhäisi and other Posse project members were helped me to study this problem deeply. Also I have influenced by the R coding style that exist in FMI and also Google R code style.
The work done in this assignment, developing and analyzing the linear regression model, namely mymodel, is fully done by me. I did the development of linear regression models for T2 and D2 and the calculation of RH on the Kumpula weather station. Jussi Ylhäisi provided the original csv file “station2998_all_data_with_timestamps.csv”. I wrote the create_weather_data.R to get the clean data (without NAs and constant variables) and the cleaned data is saved in the file “data/station2998_T2_D2_Obs_model_data_with_timestamps.csv”.
The weather data is a time series data, data taken at discrete time intervals. To show an example of a time series data, we plot the observed 2 meter temperature, called T2Obs, in degree Kelvin for the years from 2011 to 2015. As we know, the observed temperature shows a seasonal pattern, i.e, varying from high temperatures in Summer to low temperatures in Winter and also a diurnal pattern varying according to the hours of the day.
library(ggplot2)
timeseries <- read.table(file = "data/timeseries_example.csv", sep = ",", header = TRUE)
timeseries$timestamp = as.Date(strptime(timeseries$timestamp, "%Y-%m-%d%H:%M:%S"))
qplot(timeseries$timestamp, timeseries$T2Obs, xlab="year", ylab="Observed temperature (Kelvin)")
The weather data we use in this assignment is based on ECMWF’s 10-day forecasts for the Kumpula station (Station id 2998). There are 5 station and time related variables (station_id, season, analysis_time, forecast period and timestamp) and 25 predictor variables which describe, for example, temperature, dew point, relative humidity forecasts etc. We have also two response variables, the actual observations of the two-meter temperature denoted as T2_Obs and the dew point temperature D2 at 2 meters named as D2Obs. The ECMWF 10-day forecast includes forecasts for T2 and D2. We have the weather data from 2011-12-01 to 2015-10-01.
The T2Obs and D2Obs are available form [FMI opendata portal] (https://en.ilmatieteenlaitos.fi/open-data). We use the model data from the Atmospheric Model high resolution 10-day forecast (HRES) from ECMWF.
The variables of the raw forecast model from ECMWF (for the Single level -forecast) are defined here. The definitions of the predictor variables we use are given below.
| No. | Abbreviation | Description |
|---|---|---|
| 1 | station_id | ECMWF id for the stations (Check) |
| 2 | season | Winter, Spring, Summer and Autumn( 1-4) (Check) |
| 3 | analysis_time | (Check) OOUTC and 12UTC (Coordinated Universal Time) |
| 4 | forecast_period | Forecast period. (Check) |
| 5 | timestamp | Timestamps 2011-12-06 12:00:00 to 2015-10-01 12:00:00 |
| 6 | T2_Obs | Observed temperature at 2 meters |
| 7 | MSL | Mean sea level pressure |
| 8 | T2 | 2 meter temperature (ECMWF:2T) |
| 9 | D2 | 2 meter dewpoint temperature (ECMWF: 2D) |
| 10 | U10 | 10 meter U-velocity (ECMWF: 10U) |
| 11 | V10 | 10 meter V-velocity (ECMWF: 10V) |
| 12 | TP | Total precipitation |
| 13 | LCC | Low cloud cover |
| 14 | MCC | Medium cloud cover |
| 15 | HCC | High cloud cover |
| 16 | SKT | Skin temperature (temperature on the surface of the Earth) |
| 17 | FG10_3 | 10 meter wind gust in the last 3 hours (ECMWF:10FG3 |
| 18 | MX2T3 | Maximum temperature at 2 meter in the last 3 hours |
| 19 | RH_950 | Relative humidity at 950hPa |
| 20 | T_950 | Temperature at 950hPa |
| 21 | RH_925 | Relative humidity at 925hPa |
| 22 | T_925 | Temperature at 925hPa |
| 23 | RH_850 | Relative humidity at 850hPa |
| 24 | T_850 | Temperature at 850hPa |
| 25 | RH_700 | Relative humidity at 700hPa |
| 26 | T_700 | Temperature at 700hPa |
| 27 | RH_500 | Relative humidity at 500hPa |
| 28 | T_500 | Temperature at 500hPa |
| 29 | T2_M1 | 2 meter temperature at previous forecast period |
| 30 | T_950_M1 | Temperature at 950hPa at previous forecast period |
| 31 | T_925_M1 | Temperature at 925hPa at previous forecast period |
| 32 | DECLINATION | Declination |
| 33 | D2_M1 | 2 meter dew point temperature at previous forecast period |
| 34 | D2_Obs | Observed dew point temperature at 2 meters |
The original data file is in “station2998_all_data_with_timestamps.csv”. It has 299520 observations of 34 variables. Unfortunately the size of this file exceeds the GitHub’s recommended maximum file size, I cannot upload it to my GitHub directory. The file has missing values in it, represented by “NA”. We remove the variables with a large number of NAs (no. of NAs > no. of observations/4). Two variables “FG10_3” (10 meter wind gust in the last 3 hours) and “MX2T3” (Maximum temperature at 2 meter in the last 3 hours) are removed as they have large number of NAs. We remove the variables of constant variance, but this file has no variable with constant variance (except the station_id and it is not removed). We also remove the NAs from the other 32 variables. The program, create_weather_data.R does the above functions and save the “cleaned” data in the file “data/station2998_T2_D2_Obs_model_data_with_timestamps.csv”
The “cleaned” weather data, T2_D2_Obs_model_data_with_timestamps, is loaded from the file “data/station2998_T2_D2_Obs_model_data_with_timestamps.csv”. Weather data is a multidimensional data. It has 290982 observations of 32 variables. The first five columns of the weather data are the station_id, season, analysis_time, forecast period and timestamps. We choose the Kumpula station which has a station id of 2998. We have data for all the four seasons, namely, Winter, Spring, Summer and Autumn represented by the values 1,2,3, and 4 respectively.
We are using 10-day forecasts, we have 2 to 65 different forecast periods ranging from +0 to +240. Between +0 and +144, the step size is 3 hours, and between +144 and +240, the step size is 6 hours. When the forecast period has a value 2, we get a 3-hour forecast data and if its value is 8, we get a 24-hour or one day forecast data. The forecast period with value 65 represents a 10-day forecast. The forecasts are created two times a day at 00:00 UTC (Coordinated Universal Time) and 12:00 UTC. This variable is called analysis time has two values 1 and 2 corresponding to 00:00UTC and 12:00 UTC.
The response variable T2Obs is at column 6 and the other response variable D2Obs is the last column of the weather data. The columns 8 and 9 gives the ECMF model output for T2 and D2. From the ECMWF model we also have the data for temperature and relative humidity at various pressure levels such as 950 hPa(hecto Pascal), 925 hPa, 850 hPa, 700 hPa and 500 hPa. We also have data on temperature T2_M1, T_950_M1 (temperature at 950 hPa), T_925_M1 and dew point temperature D2_M1 of the corresponding variables at the previous forecast period.
ECMWF model also provides the temperature at the surface of Earth called skin temperature (SKT), wind velocity at 10 meters (U_10 and V_10). Declination data is available and it is defined as the angular distance north or south from the celestial equator measured along a great circle passing through the celestial pole.
we load the weather data, T2_D2_Obs_model_data_with_timestamps and below can see the structure of the data.
# loading the data from the file
T2_D2_Obs_model_data_with_timestamps = read.table(file = "data/station2998_T2_D2_Obs_model_data_with_timestamps.csv", sep = ",", header = TRUE)
dim(T2_D2_Obs_model_data_with_timestamps)
## [1] 290982 32
colnames(T2_D2_Obs_model_data_with_timestamps)
## [1] "station_id" "season" "analysis_time"
## [4] "forecast_period" "timestamp" "T2Obs"
## [7] "MSL" "T2" "D2"
## [10] "U10" "V10" "TP"
## [13] "LCC" "MCC" "HCC"
## [16] "SKT" "RH_950" "T_950"
## [19] "RH_925" "T_925" "RH_850"
## [22] "T_850" "RH_700" "T_700"
## [25] "RH_500" "T_500" "T2_M1"
## [28] "T_950_M1" "T_925_M1" "DECLINATION"
## [31] "D2_M1" "D2Obs"
str(T2_D2_Obs_model_data_with_timestamps)
## 'data.frame': 290982 obs. of 32 variables:
## $ station_id : int 2998 2998 2998 2998 2998 2998 2998 2998 2998 2998 ...
## $ season : int 1 1 1 1 1 1 1 1 1 1 ...
## $ analysis_time : int 1 1 1 1 1 1 1 1 1 1 ...
## $ forecast_period: int 5 5 5 5 5 5 5 5 5 5 ...
## $ timestamp : Factor w/ 11184 levels "2011-12-01 03:00:00",..: 4 12 20 28 36 44 52 60 68 76 ...
## $ T2Obs : num 279 280 276 278 276 ...
## $ MSL : num 101258 99586 99785 97798 97875 ...
## $ T2 : num 279 279 276 277 276 ...
## $ D2 : num 277 277 272 276 274 ...
## $ U10 : num 4.8 5 4.2 8.4 3.9 5.4 1.4 1.6 -5.2 5.5 ...
## $ V10 : num 5.4 4.3 3 7.3 5.4 0.5 5.9 7.2 11.5 6.9 ...
## $ TP : num 0 0.03 0 0.27 0.23 0 0.03 0.17 0.2 0.43 ...
## $ LCC : num 0.42 0.29 0.01 0.58 0.24 0 0.56 0.58 0.53 0.85 ...
## $ MCC : num 0 1 0 0.27 0.18 0.01 0.04 0.75 1 0.96 ...
## $ HCC : num 1 1 0 0.91 0.78 0.94 0.74 0.91 1 1 ...
## $ SKT : num 279 278 276 278 276 ...
## $ RH_950 : num 94.5 87.2 67.8 94.9 97 81.6 77.2 95.2 88.7 85.9 ...
## $ T_950 : num 275 276 274 276 274 ...
## $ RH_925 : num 52.2 81.6 74.7 97.4 93.4 79.7 83.2 84.3 92.1 94.9 ...
## $ T_925 : num 275 275 272 274 272 ...
## $ RH_850 : num 7.9 63.9 23 84.3 89.2 ...
## $ T_850 : num 273 270 268 270 268 ...
## $ RH_700 : num 16.5 70.1 17.9 50.6 62.5 ...
## $ T_700 : num 264 260 257 260 258 ...
## $ RH_500 : num 53.9 100.3 22.3 69.8 59.3 ...
## $ T_500 : num 246 248 246 240 237 ...
## $ T2_M1 : num 278 279 275 278 276 ...
## $ T_950_M1 : num 276 276 273 276 274 ...
## $ T_925_M1 : num 276 275 272 274 273 ...
## $ DECLINATION : num -13.6 -13.9 -14.3 -14.6 -14.9 ...
## $ D2_M1 : num 276 277 272 277 274 ...
## $ D2Obs : num 278 278 271 275 274 ...
We can find the minimum, maximum, and quantile values of the weather data for the columns 6:32 is shown here. We can see that the T2Obs has a mean value of 280K and varies from a minimum value of 247.7 K to a maximum of 303.6K. D2Obs has a mean value of 280K and varies from 244.8K to a maximum of 293.1K
# summary of the weather dataset
summary(T2_D2_Obs_model_data_with_timestamps[,-c(1:5)])
## T2Obs MSL T2 D2
## Min. :247.7 Min. : 94622 Min. :246.7 Min. :244.2
## 1st Qu.:274.1 1st Qu.:100589 1st Qu.:273.3 1st Qu.:270.6
## Median :279.6 Median :101300 Median :278.9 Median :275.7
## Mean :280.0 Mean :101280 Mean :279.3 Mean :275.9
## 3rd Qu.:287.1 3rd Qu.:101984 3rd Qu.:286.6 3rd Qu.:282.8
## Max. :303.6 Max. :105590 Max. :300.7 Max. :293.9
## U10 V10 TP LCC
## Min. :-15.2000 Min. :-14.100 Min. :-0.03000 Min. :0.0000
## 1st Qu.: -1.8000 1st Qu.: -1.400 1st Qu.: 0.00000 1st Qu.:0.0100
## Median : 1.3000 Median : 1.100 Median : 0.00000 Median :0.2700
## Mean : 0.8083 Mean : 1.032 Mean : 0.08565 Mean :0.4265
## 3rd Qu.: 3.5000 3rd Qu.: 3.300 3rd Qu.: 0.05000 3rd Qu.:0.9300
## Max. : 13.9000 Max. : 16.500 Max. : 5.75000 Max. :1.0000
## MCC HCC SKT RH_950
## Min. :0.0000 Min. :0.0000 Min. :248.0 Min. : 1.10
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:272.9 1st Qu.: 61.30
## Median :0.0900 Median :0.1000 Median :278.8 Median : 79.50
## Mean :0.3202 Mean :0.3809 Mean :279.3 Mean : 76.16
## 3rd Qu.:0.6800 3rd Qu.:0.9200 3rd Qu.:286.6 3rd Qu.: 94.60
## Max. :1.0000 Max. :1.0000 Max. :302.7 Max. :109.10
## T_950 RH_925 T_925 RH_850
## Min. :244.2 Min. : 0.90 Min. :243.8 Min. : 0.40
## 1st Qu.:271.3 1st Qu.: 60.70 1st Qu.:270.4 1st Qu.: 50.50
## Median :277.0 Median : 78.10 Median :276.0 Median : 72.90
## Mean :277.3 Mean : 74.92 Mean :276.2 Mean : 67.24
## 3rd Qu.:284.3 3rd Qu.: 93.80 3rd Qu.:282.9 3rd Qu.: 88.10
## Max. :299.0 Max. :114.50 Max. :297.2 Max. :120.20
## T_850 RH_700 T_700 RH_500
## Min. :243.1 Min. : 0.50 Min. :238.3 Min. : 0.90
## 1st Qu.:267.6 1st Qu.: 24.70 1st Qu.:259.6 1st Qu.: 26.20
## Median :273.1 Median : 50.20 Median :265.7 Median : 50.40
## Mean :272.8 Mean : 51.65 Mean :265.0 Mean : 53.92
## 3rd Qu.:278.7 3rd Qu.: 78.00 3rd Qu.:270.7 3rd Qu.: 82.90
## Max. :292.5 Max. :117.80 Max. :281.8 Max. :126.10
## T_500 T2_M1 T_950_M1 T_925_M1
## Min. :225.6 Min. :246.7 Min. :244.2 Min. :243.8
## 1st Qu.:243.7 1st Qu.:273.3 1st Qu.:271.3 1st Qu.:270.4
## Median :249.8 Median :278.9 Median :277.0 Median :276.0
## Mean :249.0 Mean :279.3 Mean :277.3 Mean :276.2
## 3rd Qu.:254.8 3rd Qu.:286.6 3rd Qu.:284.3 3rd Qu.:282.9
## Max. :265.3 Max. :302.2 Max. :299.0 Max. :297.2
## DECLINATION D2_M1 D2Obs
## Min. :-23.4401 Min. :244.2 Min. :244.8
## 1st Qu.:-16.6123 1st Qu.:270.7 1st Qu.:271.1
## Median : 0.7669 Median :275.7 Median :275.8
## Mean : 0.2004 Mean :275.9 Mean :276.0
## 3rd Qu.: 16.8441 3rd Qu.:282.8 3rd Qu.:282.8
## Max. : 23.4401 Max. :293.9 Max. :293.1
Here we plot the histogram of the Observed temperature at 2 m, Temp at 2 m - ECMWF model and Temp at 2 meters - previous forecast period. We can also see Observed dew point temperature at 2 m, dew point temp at 2 m - ECMWF model and dew point temp at 2 meters - previous forecast period. Unfortunately I am not able to reduce the font size of the title, so the full titles are not shown in the figures.
library(ggplot2)
library(gridExtra)
p1 <- qplot(T2_D2_Obs_model_data_with_timestamps$T2Obs, geom= "histogram", binwidth = 2, xlab = "T2Obs",
fill = I("red3"), col = I("gray"), main="Observed temp at 2 m")
p2 <- qplot(T2_D2_Obs_model_data_with_timestamps$T2, geom= "histogram", binwidth = 2, xlab = "T2",
fill = I("tomato"), col = I("gray"), main = "Temp at 2 m - ECMWF model")
p3 <- qplot(T2_D2_Obs_model_data_with_timestamps$T2_M1, geom= "histogram", binwidth = 2, xlab = "T2_M1",
fill = I("hotpink"), col = I("gray"), main = "Temp at 2 meters - previous forecast period")
p4 <- qplot(T2_D2_Obs_model_data_with_timestamps$D2Obs, geom= "histogram", binwidth = 2, xlab = "D2Obs",
fill = I("blue"), col = I("gray"), main = "Observed Dewpoint temp at 2 m")
p5 <- qplot(T2_D2_Obs_model_data_with_timestamps$D2, geom= "histogram", binwidth = 2, xlab = "D2",
fill = I("dodgerblue"), col = I("gray"), main = "Dewpoint temp at 2 m -ECMWF model")
p6 <- qplot(T2_D2_Obs_model_data_with_timestamps$D2_M1, geom= "histogram", binwidth = 2, xlab = "D2_M1",
fill = I("deepskyblue"), col = I("gray"), main = "Dewpoint temp at 2 m -previous forecast period")
grid.arrange(p1, p2, p3, p4,p5,p6, ncol = 3, nrow = 2)
p1 <- qplot(T2_D2_Obs_model_data_with_timestamps$U10, geom= "histogram", binwidth = 2, xlab = "U10",
fill = I("yellow"), col = I("gray"), main="10 meter U-velocity")
p2 <- qplot(T2_D2_Obs_model_data_with_timestamps$V10, geom= "histogram", binwidth = 2, xlab = "V10",
fill = I("yellowgreen"), col = I("gray"), main = "10 meter U-velocity")
grid.arrange(p1, p2, ncol = 2, nrow = 1)
p1 <- qplot(T2_D2_Obs_model_data_with_timestamps$TP, geom= "histogram", binwidth = 2, xlab = "TP",
fill = I("bisque"), col = I("white"), main="Total precipitation")
p2 <- qplot(T2_D2_Obs_model_data_with_timestamps$LCC, geom= "histogram", binwidth = 2, xlab = "LCC",
fill = I("gray60"), col = I("white"), main = " Low cloud cover")
p3 <- qplot(T2_D2_Obs_model_data_with_timestamps$MCC, geom= "histogram", binwidth = 2, xlab = "MCC",
fill = I("gray45"), col = I("white"), main = " Medium cloud cover")
p4 <- qplot(T2_D2_Obs_model_data_with_timestamps$HCC, geom= "histogram", binwidth = 2, xlab = "HCC",
fill = I("black"), col = I("white"), main = "High cloud cover")
grid.arrange(p1, p2, p3, p4, ncol = 2, nrow = 2)
p1 <- qplot(T2_D2_Obs_model_data_with_timestamps$RH_950, geom= "histogram", binwidth = 2, xlab = "RH_950",
fill = I("forestgreen"), col = I("gray"), main="Relative humidity at 950 hPa")
p2 <- qplot(T2_D2_Obs_model_data_with_timestamps$RH_925, geom= "histogram", binwidth = 2, xlab = "RH_925",
fill = I("chartreuse"), col = I("gray"), main="Relative humidity at 925 hPa")
p3 <- qplot(T2_D2_Obs_model_data_with_timestamps$T_950, geom= "histogram", binwidth = 2, xlab = "T_950",
fill = I("blueviolet"), col = I("gray"), main = "T2: Temperature at 950 hPa")
p4 <- qplot(T2_D2_Obs_model_data_with_timestamps$T_925, geom= "histogram", binwidth = 2, xlab = "T_925",
fill = I("orchid2"), col = I("gray"), main = "T2: Temperature at 925 hPa")
grid.arrange(p1, p2, p3, p4, ncol = 2, nrow = 2)
p1 <- qplot(T2_D2_Obs_model_data_with_timestamps$RH_850, geom= "histogram", binwidth = 2, xlab = "RH_850",
fill = I("brown4"), col = I("gray"), main="Relative humidity at 850 hPa")
p2 <- qplot(T2_D2_Obs_model_data_with_timestamps$RH_700, geom= "histogram", binwidth = 2, xlab = "RH_700",
fill = I("brown1"), col = I("gray"), main="Relative humidity at 700 hPa")
p3 <- qplot(T2_D2_Obs_model_data_with_timestamps$RH_500, geom= "histogram", binwidth = 2, xlab = "RH_500",
fill = I("darksalmon"), col = I("gray"), main="Relative humidity at 500 hPa")
p4 <- qplot(T2_D2_Obs_model_data_with_timestamps$T_850, geom= "histogram", binwidth = 2, xlab = "T_850",
fill = I("hotpink4"), col = I("gray"), main = "T2: Temperature at 850 hPa")
p5 <- qplot(T2_D2_Obs_model_data_with_timestamps$T_700, geom= "histogram", binwidth = 2, xlab = "T_700",
fill = I("hotpink1"), col = I("gray"), main = "T2: Temperature at 700 hPa")
p6 <- qplot(T2_D2_Obs_model_data_with_timestamps$T_500, geom= "histogram", binwidth = 2, xlab = "T_500",
fill = I("thistle1"), col = I("gray"), main = "T2: Temperature at 500 hPa")
grid.arrange(p1, p2, p3, p4, p5, p6, ncol = 3, nrow = 2)
p1 <- qplot(T2_D2_Obs_model_data_with_timestamps$T_950_M1, geom= "histogram", binwidth = 2, xlab = "RH_700",
fill = I("purple"), col = I("gray"), main="Temperature at 950 hPa at previous forecast period")
p2 <- qplot(T2_D2_Obs_model_data_with_timestamps$T_925_M1, geom= "histogram", binwidth = 2, xlab = "RH_500",
fill = I("magenta"), col = I("gray"), main="Temperature at 925 hPa at previous forecast period")
p3 <- qplot(T2_D2_Obs_model_data_with_timestamps$SKT, geom= "histogram", binwidth = 2, xlab = "SKT",
fill = I("limegreen"), col = I("gray"), main="Skin temperature")
p4 <- qplot(T2_D2_Obs_model_data_with_timestamps$DECLINATION, geom= "histogram", binwidth = 2, xlab = "DECLINATION",
fill = I("turquoise"), col = I("gray"), main = "Declination")
grid.arrange(p1, p2, p3, p4, ncol = 2, nrow = 2)
As our aim is to develop a linear regression model for T2 and D2, it will be a good idea to study the correlation between the variables of the weather data. Below we can see the correlation between the variables of weather data. We describe in detail about the correlations when we go to the next section on the linear regression models.
cor(as.matrix(T2_D2_Obs_model_data_with_timestamps[,6:ncol(T2_D2_Obs_model_data_with_timestamps)]))
## T2Obs MSL T2 D2
## T2Obs 1.000000000 -0.05572234 0.95677047 0.905251875
## MSL -0.055722339 1.00000000 -0.08791121 -0.156226839
## T2 0.956770473 -0.08791121 1.00000000 0.950569781
## D2 0.905251875 -0.15622684 0.95056978 1.000000000
## U10 0.067249551 -0.25160619 0.08568880 0.092750632
## V10 0.083376989 -0.14451363 0.11372004 0.183780138
## TP 0.034104010 -0.28421222 0.05729668 0.143197694
## LCC -0.383691554 -0.15559078 -0.34719821 -0.219142567
## MCC -0.088984980 -0.31262176 -0.05359143 -0.009174696
## HCC 0.006123515 -0.18998591 0.02740180 0.083962326
## SKT 0.945729355 -0.07374940 0.98761518 0.926508834
## RH_950 -0.225417792 -0.33476345 -0.17360235 0.006965672
## T_950 0.933069935 -0.08665758 0.95428786 0.937783868
## RH_925 -0.102333330 -0.38616568 -0.04864560 0.096968298
## T_925 0.923163399 -0.07201997 0.94304485 0.936505625
## RH_850 0.126840672 -0.41174015 0.16883134 0.241192270
## T_850 0.882236637 -0.01406959 0.90345704 0.919827444
## RH_700 -0.051455549 -0.35600257 -0.02325322 0.053286552
## T_700 0.839162854 0.09190267 0.86100183 0.870233283
## RH_500 -0.161975033 -0.25026527 -0.14241710 -0.089229234
## T_500 0.806916932 0.13931168 0.82741602 0.822158694
## T2_M1 0.939975772 -0.09340257 0.97775655 0.942577263
## T_950_M1 0.925904241 -0.09404718 0.94656890 0.937203206
## T_925_M1 0.917905055 -0.08163763 0.93769669 0.936316586
## DECLINATION 0.870033713 0.02963431 0.89158550 0.831878801
## D2_M1 0.902328990 -0.16428168 0.94392029 0.986912795
## D2Obs 0.906747887 -0.15062462 0.89559842 0.932057501
## U10 V10 TP LCC MCC
## T2Obs 0.0672495510 0.08337699 0.03410401 -0.3836916 -0.088984980
## MSL -0.2516061855 -0.14451363 -0.28421222 -0.1555908 -0.312621756
## T2 0.0856887975 0.11372004 0.05729668 -0.3471982 -0.053591427
## D2 0.0927506318 0.18378014 0.14319769 -0.2191426 -0.009174696
## U10 1.0000000000 0.17638564 -0.07357771 -0.1202628 -0.121005183
## V10 0.1763856378 1.00000000 0.15411112 0.1900064 0.183867058
## TP -0.0735777117 0.15411112 1.00000000 0.2802422 0.371971817
## LCC -0.1202627949 0.19000640 0.28024220 1.0000000 0.293703773
## MCC -0.1210051826 0.18386706 0.37197182 0.2937038 1.000000000
## HCC -0.1048070197 0.23374528 0.20682647 0.1283931 0.421509022
## SKT 0.0552707731 0.10184030 0.05680998 -0.3175593 -0.047112191
## RH_950 0.0007486148 0.25620542 0.30807049 0.7102336 0.283402843
## T_950 0.0924834782 0.07698242 0.04149562 -0.4237268 -0.094934276
## RH_925 -0.0271169977 0.21620653 0.31443274 0.6508228 0.316837078
## T_925 0.0898313705 0.09243508 0.04733645 -0.4089855 -0.098312864
## RH_850 -0.0630461564 0.10901970 0.32251009 0.3771230 0.438559607
## T_850 0.0555016718 0.15054153 0.06406793 -0.3152063 -0.114083405
## RH_700 -0.0702727463 0.10852325 0.36381725 0.2768023 0.589872232
## T_700 -0.0231638329 0.16060536 0.05516177 -0.2563162 -0.115870251
## RH_500 -0.0483983706 0.14527154 0.21917937 0.1825503 0.581506818
## T_500 -0.0401444308 0.13379521 0.04069243 -0.2832337 -0.088998560
## T2_M1 0.0892890825 0.08541580 0.05910432 -0.3628751 -0.065680899
## T_950_M1 0.0940793151 0.05586961 0.04401064 -0.4188925 -0.100053204
## T_925_M1 0.0939991617 0.07144738 0.04845132 -0.4035580 -0.104133257
## DECLINATION -0.0065621144 -0.08136880 0.02631006 -0.4433739 -0.101191179
## D2_M1 0.1011994993 0.14830507 0.12800187 -0.2345538 -0.031400001
## D2Obs 0.0977762603 0.15584330 0.10274938 -0.2355420 -0.030938198
## HCC SKT RH_950 T_950 RH_925
## T2Obs 0.006123515 0.94572935 -0.2254177916 0.93306994 -0.10233333
## MSL -0.189985910 -0.07374940 -0.3347634473 -0.08665758 -0.38616568
## T2 0.027401800 0.98761518 -0.1736023523 0.95428786 -0.04864560
## D2 0.083962326 0.92650883 0.0069656723 0.93778387 0.09696830
## U10 -0.104807020 0.05527077 0.0007486148 0.09248348 -0.02711700
## V10 0.233745277 0.10184030 0.2562054249 0.07698242 0.21620653
## TP 0.206826472 0.05680998 0.3080704853 0.04149562 0.31443274
## LCC 0.128393126 -0.31755930 0.7102335630 -0.42372680 0.65082283
## MCC 0.421509022 -0.04711219 0.2834028426 -0.09493428 0.31683708
## HCC 1.000000000 0.02115931 0.1743092262 0.02501490 0.16744109
## SKT 0.021159308 1.00000000 -0.1557906797 0.92156630 -0.03598029
## RH_950 0.174309226 -0.15579068 1.0000000000 -0.28194771 0.89481233
## T_950 0.025014901 0.92156630 -0.2819477092 1.00000000 -0.14805078
## RH_925 0.167441089 -0.03598029 0.8948123298 -0.14805078 1.00000000
## T_925 0.035176482 0.91046667 -0.2671040922 0.99317960 -0.16967106
## RH_850 0.145120710 0.17146633 0.4416321765 0.13018712 0.60041071
## T_850 0.067579127 0.87433721 -0.1586977252 0.93948008 -0.09611277
## RH_700 0.245108324 -0.01781787 0.3563350800 -0.04073663 0.39761722
## T_700 0.082134768 0.83759521 -0.1457914095 0.87963528 -0.07531005
## RH_500 0.489669741 -0.14107862 0.2185775948 -0.15236469 0.21102203
## T_500 0.107825895 0.80605285 -0.1856005460 0.84291866 -0.11415980
## T2_M1 0.020391763 0.95404922 -0.1825229238 0.95292850 -0.05854807
## T_950_M1 0.021741645 0.91240215 -0.2661152157 0.99197565 -0.14178616
## T_925_M1 0.030050261 0.90396482 -0.2522525073 0.98672831 -0.15453719
## DECLINATION -0.041371530 0.88899220 -0.2920096878 0.87992845 -0.16398576
## D2_M1 0.066000071 0.91985131 -0.0100536107 0.93337723 0.08077807
## D2Obs 0.063453247 0.87373781 -0.0192977938 0.88936475 0.07055859
## T_925 RH_850 T_850 RH_700 T_700
## T2Obs 0.92316340 0.12684067 0.88223664 -0.05145555 0.83916285
## MSL -0.07201997 -0.41174015 -0.01406959 -0.35600257 0.09190267
## T2 0.94304485 0.16883134 0.90345704 -0.02325322 0.86100183
## D2 0.93650563 0.24119227 0.91982744 0.05328655 0.87023328
## U10 0.08983137 -0.06304616 0.05550167 -0.07027275 -0.02316383
## V10 0.09243508 0.10901970 0.15054153 0.10852325 0.16060536
## TP 0.04733645 0.32251009 0.06406793 0.36381725 0.05516177
## LCC -0.40898549 0.37712300 -0.31520630 0.27680226 -0.25631621
## MCC -0.09831286 0.43855961 -0.11408341 0.58987223 -0.11587025
## HCC 0.03517648 0.14512071 0.06757913 0.24510832 0.08213477
## SKT 0.91046667 0.17146633 0.87433721 -0.01781787 0.83759521
## RH_950 -0.26710409 0.44163218 -0.15869773 0.35633508 -0.14579141
## T_950 0.99317960 0.13018712 0.93948008 -0.04073663 0.87963528
## RH_925 -0.16967106 0.60041071 -0.09611277 0.39761722 -0.07531005
## T_925 1.00000000 0.11250659 0.95761159 -0.04121105 0.89607555
## RH_850 0.11250659 1.00000000 0.02791087 0.53046817 0.04289074
## T_850 0.95761159 0.02791087 1.00000000 -0.04726922 0.94916513
## RH_700 -0.04121105 0.53046817 -0.04726922 1.00000000 -0.11481186
## T_700 0.89607555 0.04289074 0.94916513 -0.11481186 1.00000000
## RH_500 -0.15014955 0.20837072 -0.14577627 0.43980885 -0.15623849
## T_500 0.85666579 0.02188350 0.89948416 -0.10318307 0.95570459
## T2_M1 0.94181997 0.16072684 0.90111712 -0.02887439 0.85620271
## T_950_M1 0.98675557 0.12494343 0.93595178 -0.04081655 0.87451829
## T_925_M1 0.99188630 0.10740663 0.95345013 -0.04284956 0.89073414
## DECLINATION 0.86593866 0.11576323 0.80617637 -0.05128072 0.77495001
## D2_M1 0.93119075 0.22321276 0.91292205 0.03326261 0.86085167
## D2Obs 0.88767322 0.20937636 0.87188165 0.02998447 0.82491440
## RH_500 T_500 T2_M1 T_950_M1 T_925_M1
## T2Obs -0.16197503 0.80691693 0.93997577 0.92590424 0.91790505
## MSL -0.25026527 0.13931168 -0.09340257 -0.09404718 -0.08163763
## T2 -0.14241710 0.82741602 0.97775655 0.94656890 0.93769669
## D2 -0.08922923 0.82215869 0.94257726 0.93720321 0.93631659
## U10 -0.04839837 -0.04014443 0.08928908 0.09407932 0.09399916
## V10 0.14527154 0.13379521 0.08541580 0.05586961 0.07144738
## TP 0.21917937 0.04069243 0.05910432 0.04401064 0.04845132
## LCC 0.18255027 -0.28323374 -0.36287509 -0.41889254 -0.40355801
## MCC 0.58150682 -0.08899856 -0.06568090 -0.10005320 -0.10413326
## HCC 0.48966974 0.10782590 0.02039176 0.02174164 0.03005026
## SKT -0.14107862 0.80605285 0.95404922 0.91240215 0.90396482
## RH_950 0.21857759 -0.18560055 -0.18252292 -0.26611522 -0.25225251
## T_950 -0.15236469 0.84291866 0.95292850 0.99197565 0.98672831
## RH_925 0.21102203 -0.11415980 -0.05854807 -0.14178616 -0.15453719
## T_925 -0.15014955 0.85666579 0.94181997 0.98675557 0.99188630
## RH_850 0.20837072 0.02188350 0.16072684 0.12494343 0.10740663
## T_850 -0.14577627 0.89948416 0.90111712 0.93595178 0.95345013
## RH_700 0.43980885 -0.10318307 -0.02887439 -0.04081655 -0.04284956
## T_700 -0.15623849 0.95570459 0.85620271 0.87451829 0.89073414
## RH_500 1.00000000 -0.16433429 -0.14865150 -0.15337225 -0.15279333
## T_500 -0.16433429 1.00000000 0.82191579 0.83701722 0.85025947
## T2_M1 -0.14865150 0.82191579 1.00000000 0.95418324 0.94296268
## T_950_M1 -0.15337225 0.83701722 0.95418324 1.00000000 0.99321331
## T_925_M1 -0.15279333 0.85025947 0.94296268 0.99321331 1.00000000
## DECLINATION -0.17024680 0.76708514 0.89095089 0.87972000 0.86576181
## D2_M1 -0.10612808 0.81133041 0.95047514 0.93765387 0.93637685
## D2Obs -0.09725270 0.78030745 0.89262683 0.88997069 0.88868619
## DECLINATION D2_M1 D2Obs
## T2Obs 0.870033713 0.90232899 0.90674789
## MSL 0.029634306 -0.16428168 -0.15062462
## T2 0.891585496 0.94392029 0.89559842
## D2 0.831878801 0.98691280 0.93205750
## U10 -0.006562114 0.10119950 0.09777626
## V10 -0.081368797 0.14830507 0.15584330
## TP 0.026310057 0.12800187 0.10274938
## LCC -0.443373868 -0.23455381 -0.23554203
## MCC -0.101191179 -0.03140000 -0.03093820
## HCC -0.041371530 0.06600007 0.06345325
## SKT 0.888992201 0.91985131 0.87373781
## RH_950 -0.292009688 -0.01005361 -0.01929779
## T_950 0.879928453 0.93337723 0.88936475
## RH_925 -0.163985756 0.08077807 0.07055859
## T_925 0.865938655 0.93119075 0.88767322
## RH_850 0.115763230 0.22321276 0.20937636
## T_850 0.806176371 0.91292205 0.87188165
## RH_700 -0.051280717 0.03326261 0.02998447
## T_700 0.774950012 0.86085167 0.82491440
## RH_500 -0.170246803 -0.10612808 -0.09725270
## T_500 0.767085144 0.81133041 0.78030745
## T2_M1 0.890950895 0.95047514 0.89262683
## T_950_M1 0.879720001 0.93765387 0.88997069
## T_925_M1 0.865761809 0.93637685 0.88868619
## DECLINATION 1.000000000 0.83118261 0.80153457
## D2_M1 0.831182610 1.00000000 0.92886882
## D2Obs 0.801534566 0.92886882 1.00000000
library(corrplot); library(dplyr)
cor_matrix<-cor(T2_D2_Obs_model_data_with_timestamps[,6:ncol(T2_D2_Obs_model_data_with_timestamps)]) %>% round(digits=2)
# visualize the correlation matrix
corrplot(cor_matrix, method="number", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)
When we try to plot the scatter plot for even winter data with all the variables, it was very difficult to visualize the data as we have 26 variables. So we take a smaller set of variables such as “T2Obs”, “MSL”, “T2”,“D2”, “U10”,“V10”,“TP”, “LCC”,“MCC”, “HCC”,“SKT”,“RH_950”,“T_950”,“T2_M1”,“T_950_M1”,“DECLINATION”, “D2_M1”,“D2Obs” and do the scatter plot.
library (ggplot2)
library (GGally)
df_winter_00_08 <- T2_D2_Obs_model_data_with_timestamps[which(T2_D2_Obs_model_data_with_timestamps$forecast_period == 8 & T2_D2_Obs_model_data_with_timestamps$analysis_time == 1 & T2_D2_Obs_model_data_with_timestamps$season == 1),]
df_winter_00_08_small <- df_winter_00_08[, c(6, 7,8, 9,10, 11,12,13,14,15,16,17,18,27,28,30,31,32)]
ggpairs(df_winter_00_08_small[,1:ncol(df_winter_00_08_small)], mapping = aes(), lower = list(combo = wrap("facethist", bins = 20))) + ggtitle("Matrix of scatter plots and distributions of weather data")
We plot below the T2Obs with the observations. We may not infer much from the plot other than a seasonal change in the observed temperature at 2 meters. So we separate the data into small chunks based on the seasons and the forecast periods.
plot(T2_D2_Obs_model_data_with_timestamps$T2Obs, main = "Observed temperature at 2 meters", xlab = "Observations", ylab ="T2Obs")
Below we can see the weather data for all the seasons where the analysis time in 00:00 UTC and for one-day forecast.
df_00_08 <- T2_D2_Obs_model_data_with_timestamps[which(T2_D2_Obs_model_data_with_timestamps$forecast_period == 8 & T2_D2_Obs_model_data_with_timestamps$analysis_time == 1),]
df_00_08$timestamp = as.Date(strptime(df_00_08$timestamp, "%Y-%m-%d%H:%M:%S"))
qplot(df_00_08$timestamp, df_00_08$T2Obs, xlab="year", ylab="T2Obs")
Below we can see the weather data for the winter season where the analysis time in 00:00 UTC and for one-day forecast.
df_winter_00_08 <- T2_D2_Obs_model_data_with_timestamps[which(T2_D2_Obs_model_data_with_timestamps$forecast_period == 8 & T2_D2_Obs_model_data_with_timestamps$analysis_time == 1 & T2_D2_Obs_model_data_with_timestamps$season == 1),]
df_winter_00_08$timestamp = as.Date(strptime(df_winter_00_08$timestamp, "%Y-%m-%d%H:%M:%S"))
qplot(df_winter_00_08$timestamp, df_winter_00_08$T2Obs, xlab="year", ylab="T2Obs")
This plot is interesting, we can see that the low temperatures do not have a linear variation.
qqnorm(df_winter_00_08$T2Obs); qqline(df_winter_00_08$T2Obs)
We can observe that the T2 from ECMWF model differs from the T2Obs.
qqnorm(df_winter_00_08$T2); qqline(df_winter_00_08$T2)
df_winter_00_65 <- T2_D2_Obs_model_data_with_timestamps[which(T2_D2_Obs_model_data_with_timestamps$forecast_period == 65 & T2_D2_Obs_model_data_with_timestamps$analysis_time == 1 & T2_D2_Obs_model_data_with_timestamps$season == 3),]
df_winter_00_65$timestamp = as.Date(strptime(df_winter_00_65$timestamp, "%Y-%m-%d%H:%M:%S"))
qplot(df_winter_00_65$timestamp, df_winter_00_65$T2Obs, xlab="year", ylab="T2Obs")
df_summer_00_65 <- T2_D2_Obs_model_data_with_timestamps[which(T2_D2_Obs_model_data_with_timestamps$forecast_period == 65 & T2_D2_Obs_model_data_with_timestamps$analysis_time == 3 & T2_D2_Obs_model_data_with_timestamps$season == 3),]
df_summer_00_65$timestamp = as.Date(strptime(df_summer_00_65$timestamp, "%Y-%m-%d%H:%M:%S"))
qplot(df_summer_00_65$timestamp, df_summer_00_65$T2Obs, xlab="year", ylab="T2Obs")
### QQ plot of winter temp T2Obs 10 day forecast
qqnorm(df_winter_00_65$T2Obs); qqline(df_winter_00_65$T2Obs)
qqnorm(df_winter_00_65$T2); qqline(df_winter_00_65$T2)
We also shows similar graphs for the summer temperatures.
df_summer_00_08 <- T2_D2_Obs_model_data_with_timestamps[which(T2_D2_Obs_model_data_with_timestamps$forecast_period == 8 & T2_D2_Obs_model_data_with_timestamps$analysis_time == 1 & T2_D2_Obs_model_data_with_timestamps$season == 3),]
df_summer_00_08$timestamp = as.Date(strptime(df_summer_00_08$timestamp, "%Y-%m-%d%H:%M:%S"))
qplot(df_summer_00_08$timestamp, df_summer_00_08$T2Obs, xlab="year", ylab="T2Obs")
qqnorm(df_summer_00_08$T2Obs); qqline(df_summer_00_08$T2Obs)
### QQ plot of summer ECMWF model T2
qqnorm(df_summer_00_08$T2); qqline(df_summer_00_08$T2)
library(ggplot2)
df_summer_00_08 <- T2_D2_Obs_model_data_with_timestamps[which(T2_D2_Obs_model_data_with_timestamps$forecast_period == 8 & T2_D2_Obs_model_data_with_timestamps$analysis_time == 1 & T2_D2_Obs_model_data_with_timestamps$season == 3),]
df_summer_00_08$timestamp = as.Date(strptime(df_summer_00_08$timestamp, "%Y-%m-%d%H:%M:%S"))
qplot(df_summer_00_08$timestamp, df_summer_00_08$T2Obs, xlab="year", ylab="T2Obs")
qqnorm(df_summer_00_08$T2Obs); qqline(df_summer_00_08$T2Obs)
### QQ plot of summer ECMWF model T2
qqnorm(df_summer_00_08$T2); qqline(df_summer_00_08$T2)
Now we plot some graphs depicting the relation ship between T2Obs ad T2 from ECMWF for winter for one-day forecast. We can observe that at high temperatures there is a linear relationship than for the low values of T2Obs ad T2 from ECMWF. We used to method lm to plot this relationship.
T2Obs <- df_winter_00_08$T2Obs
T2 <- df_winter_00_08$T2
ggplot(df_winter_00_08, aes(x = T2Obs, y = T2)) + geom_point() + xlab("ECMWF model T2") + ylab("T2Obs") + ggtitle("T2 observed vs ECMWF model T2 for Winter ") + geom_smooth(method = "lm")
A similar kind of graph we can observe between D2Obs and D2 from ECMWF for winter for one-day forecast
D2Obs <- df_winter_00_08$D2Obs
D2 <- df_winter_00_08$D2
ggplot(df_winter_00_08, aes(x = T2Obs, y = T2)) + geom_point() + xlab("ECMWF model D2") + ylab("D2Obs") + ggtitle("D2 observed vs ECMWF model D2 for Winter") + geom_smooth(method = "lm")
we can observe that for 10-day forecast, the points of T2Obs and T2 from ECMWF model are more scattered. A similar observation is there between D2Obs and D2 from ECMWF model for the 10 day forecast.
T2Obs <- df_winter_00_65$T2Obs
T2 <- df_winter_00_65$T2
ggplot(df_winter_00_65, aes(x = T2Obs, y = T2)) + geom_point() + xlab("ECMWF model T2") + ylab("T2Obs") + ggtitle("T2 observed vs ECMWF model T2 for Winter ") + geom_smooth(method = "lm")
D2Obs <- df_winter_00_65$D2Obs
D2 <- df_winter_00_65$D2
ggplot(df_winter_00_65, aes(x = T2Obs, y = T2)) + geom_point() + xlab("ECMWF model D2") + ylab("D2Obs") + ggtitle("D2 observed vs ECMWF model D2 for Winter") + geom_smooth(method = "lm")
#### T2Obs vs T2 from ECMWF model for Summer forecast period 08
The following plots show the relation ship between T2Obs and T2 from ECMWF model and T2Obs and T2 from ECMWF model for one-day and 10 day forecasts.
T2Obs <- df_winter_00_08$T2Obs
T2 <- df_summer_00_08$T2
ggplot(df_summer_00_08, aes(x = T2Obs, y = T2)) + geom_point() + xlab("ECMWF model T2") + ylab("T2Obs") + ggtitle("T2 observed vs ECMWF model T2 for Summer") + geom_smooth(method = "lm")
D2Obs <- df_winter_00_08$D2Obs
D2 <- df_summer_00_08$D2
ggplot(df_summer_00_08, aes(x = T2Obs, y = D2)) + geom_point() + xlab("ECMWF model D2") + ylab("D2Obs") + ggtitle("D2 observed vs ECMWF model D2 for Summer") + geom_smooth(method = "lm")
T2Obs <- df_winter_00_65$T2Obs
T2 <- df_summer_00_65$T2
ggplot(df_summer_00_65, aes(x = T2Obs, y = T2)) + geom_point() + xlab("ECMWF model T2") + ylab("T2Obs") + ggtitle("T2 observed vs ECMWF model T2 for Summer") + geom_smooth(method = "lm")
D2Obs <- df_winter_00_65$D2Obs
D2 <- df_summer_00_65$D2
ggplot(df_summer_00_65, aes(x = T2Obs, y = D2)) + geom_point() + xlab("ECMWF model D2") + ylab("D2Obs") + ggtitle("D2 observed vs ECMWF model D2 for Summer") + geom_smooth(method = "lm")
Now we plot quantiles of the difference in T2Obs and T2 from ECMWF model for all the forecast periods. We Can see that as the length of the forecast period increases, the variation in the quantiles increases both for high temperatures and low temperatures.
T2Obs_T2 <- T2_D2_Obs_model_data_with_timestamps$T2Obs - T2_D2_Obs_model_data_with_timestamps$T2
df_T2Obs_T2 <- cbind(T2_D2_Obs_model_data_with_timestamps, T2Obs_T2)
colnames(df_T2Obs_T2[ncol(df_T2Obs_T2)]) <- "T2Obs_T2"
ggplot(df_T2Obs_T2, aes(x = factor(df_T2Obs_T2$forecast_period), y = df_T2Obs_T2$T2Obs_T2))+ xlab("forecast_period(2-65)") + ylab("T2Obs - T2") + ggtitle("Difference in the Observed temperature and the temperature from ECMWF model") + geom_boxplot(width = 1, colour="blue")
In order to model T2 and D2 we use the linear regression model. Linear regression is the modelling technique in which we find a linear relationship between a scalar dependent variable, called response or predictand and one or more exploratory variables called predictors.
Let us take a statistical dataset which consists of $(y, x_1, x_2, …., x_n) $ where \(y\) is the response or predictand variable and $ x_1, x_2, …., x_n)$ are the predictors. We can find a linear relationship between the predictand and the predictors as
$ ,= , _1 x_1 + _2_x2 + + _n x_n $, where $ $ is the approximation to the predictand \(y\)
In linear regression model our aim is to find the \(\beta\) variables so that the squared error, $ (y - )^2 $ is minimum.
In our assignment, we find the relationship between T2Obs and the variables given as the output of the ECMWF numerical prediction model. A similar regression model for the D2Obs and the ECMWF outputs are also found.
The weather data vary widely between the seasons and the ECMWF forecast model depends much on the forecast period. But for the analysis purpose we take only one dataset for analysis, winter data for 1-day forecast period and summer data for one-day forecast period. We predict T2 and D2 using our model on the data. In the later part of the assignment we use the model to predict T2 and D2 on the whole Kumpula weather station data for 4 seasons and 64 forecast periods.
We load the data from the file “data/station2998_T2_D2_Obs_model_data_with_timestamps.csv” and select the data for one-day forecast for the winter season. We use the analysis period as 00:00 UTC. We then randomly split the data as training and test sets. The dataset is named as df_winter_00_08.
The dataset df_winter has 573 observations of 32 variables. The first four columns are the station_id, season, analysis_time, forecast_period and timestamp and we call them collectively as station info. The 6th column is T2Obs and the 32nd column is the D2Obs. The columns from 7 to 31 contains the model data from the ECMWF numerical model.
# loading the data from the file
T2_D2_Obs_model_data_with_timestamps = read.table(file = "data/station2998_T2_D2_Obs_model_data_with_timestamps.csv", sep = ",", header = TRUE)
df_winter_00_08 <- T2_D2_Obs_model_data_with_timestamps[which(T2_D2_Obs_model_data_with_timestamps$forecast_period == 8 & T2_D2_Obs_model_data_with_timestamps$analysis_time == 1 & T2_D2_Obs_model_data_with_timestamps$season == 1),]
dim(df_winter_00_08)
## [1] 573 32
str(df_winter_00_08)
## 'data.frame': 573 obs. of 32 variables:
## $ station_id : int 2998 2998 2998 2998 2998 2998 2998 2998 2998 2998 ...
## $ season : int 1 1 1 1 1 1 1 1 1 1 ...
## $ analysis_time : int 1 1 1 1 1 1 1 1 1 1 ...
## $ forecast_period: int 8 8 8 8 8 8 8 8 8 8 ...
## $ timestamp : Factor w/ 11184 levels "2011-12-01 03:00:00",..: 7 15 23 31 39 47 55 63 71 79 ...
## $ T2Obs : num 279 276 278 277 277 ...
## $ MSL : num 100384 99098 98962 97654 97876 ...
## $ T2 : num 278 276 277 276 275 ...
## $ D2 : num 276 275 275 275 271 ...
## $ U10 : num 3 3.7 0.5 7 4.6 3.7 1.1 2.8 -2.3 6 ...
## $ V10 : num 10.2 0.6 11.1 8 2.6 0.8 7.9 4 10.6 2.1 ...
## $ TP : num 0.03 0.2 0.07 0.1 0 0 0.47 0.17 0.5 0.17 ...
## $ LCC : num 0.05 0.02 0.86 0.39 0.12 0.15 0.51 0.55 0.62 0.8 ...
## $ MCC : num 0.99 1 0.88 0.08 0.27 0.98 0.91 0.4 0.69 0.71 ...
## $ HCC : num 1 0.95 0.92 0.76 0 0.98 0.97 0 0.32 0.59 ...
## $ SKT : num 278 276 277 277 275 ...
## $ RH_950 : num 87.2 84.7 95.1 94 57.2 67.2 98.1 92.6 97.4 95.9 ...
## $ T_950 : num 275 274 274 274 275 ...
## $ RH_925 : num 72.2 82.2 99.7 93.4 56.1 ...
## $ T_925 : num 274 273 272 273 274 ...
## $ RH_850 : num 37.8 63 93.7 84.2 72.5 ...
## $ T_850 : num 273 268 269 269 267 ...
## $ RH_700 : num 82.7 77.2 76.2 60.7 26.6 19.8 61 19.9 97.1 84.5 ...
## $ T_700 : num 262 258 260 258 257 ...
## $ RH_500 : num 38.1 99.4 82 72.6 43.6 ...
## $ T_500 : num 248 242 246 238 236 ...
## $ T2_M1 : num 279 277 278 276 275 ...
## $ T_950_M1 : num 275 274 275 274 273 ...
## $ T_925_M1 : num 275 273 274 273 272 ...
## $ DECLINATION : num -13.7 -14.1 -14.4 -14.7 -15 ...
## $ D2_M1 : num 276 275 275 276 274 ...
## $ D2Obs : num 275 274 274 275 274 ...
From the df_winter_00_08, we create two datasets one for the T2Obs and model data and the other for D2Obs and the model data. The model data is the same for both the datasets and the response variables T2Obs and D2Obs are in the first column of the T2Obs dataset and the D2Obs dataset respectively. We then split the data into train and test data. The train data has 430 observations of 26 variables and the test data has 143 observations of 26 variables. Notice that we removed the station info is removed as the variables there are not influencing the linear regression model.
# Splitting into train and test data
set.seed(123)
n <- nrow(df_winter_00_08)
train = sample(1:n, size = round(0.75*n), replace=FALSE)
T2_winter_00_08_train <- df_winter_00_08[train,6:(ncol(df_winter_00_08) - 1)]
T2_winter_00_08_test <- df_winter_00_08[-train,6:(ncol(df_winter_00_08) - 1)]
D2_winter_00_08_train <- cbind(df_winter_00_08[train,ncol(df_winter_00_08)],
df_winter_00_08[train,7:(ncol(df_winter_00_08) - 1)])
colnames(D2_winter_00_08_train)[1] <- "D2Obs"
D2_winter_00_08_test <- cbind(df_winter_00_08[-train,ncol(df_winter_00_08)],
df_winter_00_08[-train,7:(ncol(df_winter_00_08) - 1)])
colnames(D2_winter_00_08_test)[1] <- "D2Obs"
dim(T2_winter_00_08_train)
## [1] 430 26
dim(T2_winter_00_08_test)
## [1] 143 26
dim(D2_winter_00_08_train)
## [1] 430 26
dim(D2_winter_00_08_test)
## [1] 143 26
str(T2_winter_00_08_train)
## 'data.frame': 430 obs. of 26 variables:
## $ T2Obs : num 270 274 266 264 274 ...
## $ MSL : num 102001 103628 102918 102871 99973 ...
## $ T2 : num 271 274 267 269 274 ...
## $ D2 : num 268 272 264 266 272 ...
## $ U10 : num -2.6 2.6 -1.2 -6.2 4.7 3.7 6.2 2.7 4.8 1.5 ...
## $ V10 : num 6.1 0.5 -3 1.1 5.3 1.4 0.5 3.5 1.9 -3 ...
## $ TP : num 0.1 0 0 0.03 0.2 0 0 0.17 0.1 0 ...
## $ LCC : num 0.99 0.69 1 0.93 0.58 0 0 0.52 1 0 ...
## $ MCC : num 0.74 0.94 0 0.55 0.59 0.78 0.47 0.29 0.47 0 ...
## $ HCC : num 0 0 0 0.94 0 0 0 0 0 0 ...
## $ SKT : num 272 274 266 270 273 ...
## $ RH_950 : num 94.8 95.9 95.3 100.7 99 ...
## $ T_950 : num 266 270 263 264 272 ...
## $ RH_925 : num 100.5 95.7 98 73.8 100.6 ...
## $ T_925 : num 264 269 261 265 270 ...
## $ RH_850 : num 98.8 90.8 63.2 23.5 91.9 36.5 26.9 83.3 66.8 32.7 ...
## $ T_850 : num 261 266 261 266 267 ...
## $ RH_700 : num 104.4 54 76.6 12.8 90.1 ...
## $ T_700 : num 254 259 256 258 256 ...
## $ RH_500 : num 38 23 26.8 96 64.5 85.4 5.3 19 90.9 35.9 ...
## $ T_500 : num 238 244 241 242 236 ...
## $ T2_M1 : num 270 274 267 269 274 ...
## $ T_950_M1 : num 265 271 262 264 272 ...
## $ T_925_M1 : num 263 269 260 264 270 ...
## $ DECLINATION: num -17.7 -12.7 -20.4 -23.4 -19.9 ...
## $ D2_M1 : num 266 272 265 266 273 ...
str(D2_winter_00_08_train)
## 'data.frame': 430 obs. of 26 variables:
## $ D2Obs : num 266 273 265 262 274 ...
## $ MSL : num 102001 103628 102918 102871 99973 ...
## $ T2 : num 271 274 267 269 274 ...
## $ D2 : num 268 272 264 266 272 ...
## $ U10 : num -2.6 2.6 -1.2 -6.2 4.7 3.7 6.2 2.7 4.8 1.5 ...
## $ V10 : num 6.1 0.5 -3 1.1 5.3 1.4 0.5 3.5 1.9 -3 ...
## $ TP : num 0.1 0 0 0.03 0.2 0 0 0.17 0.1 0 ...
## $ LCC : num 0.99 0.69 1 0.93 0.58 0 0 0.52 1 0 ...
## $ MCC : num 0.74 0.94 0 0.55 0.59 0.78 0.47 0.29 0.47 0 ...
## $ HCC : num 0 0 0 0.94 0 0 0 0 0 0 ...
## $ SKT : num 272 274 266 270 273 ...
## $ RH_950 : num 94.8 95.9 95.3 100.7 99 ...
## $ T_950 : num 266 270 263 264 272 ...
## $ RH_925 : num 100.5 95.7 98 73.8 100.6 ...
## $ T_925 : num 264 269 261 265 270 ...
## $ RH_850 : num 98.8 90.8 63.2 23.5 91.9 36.5 26.9 83.3 66.8 32.7 ...
## $ T_850 : num 261 266 261 266 267 ...
## $ RH_700 : num 104.4 54 76.6 12.8 90.1 ...
## $ T_700 : num 254 259 256 258 256 ...
## $ RH_500 : num 38 23 26.8 96 64.5 85.4 5.3 19 90.9 35.9 ...
## $ T_500 : num 238 244 241 242 236 ...
## $ T2_M1 : num 270 274 267 269 274 ...
## $ T_950_M1 : num 265 271 262 264 272 ...
## $ T_925_M1 : num 263 269 260 264 270 ...
## $ DECLINATION: num -17.7 -12.7 -20.4 -23.4 -19.9 ...
## $ D2_M1 : num 266 272 265 266 273 ...
Here we find the correlation matrix for the variables of the dataset df_winter_00_08
library(corrplot); library(dplyr)
cor(as.matrix(df_winter_00_08[,6:ncol(df_winter_00_08)]))
## T2Obs MSL T2 D2 U10
## T2Obs 1.00000000 -0.413811966 0.94819952 0.9283360 0.39888861
## MSL -0.41381197 1.000000000 -0.42317943 -0.4285554 -0.29288672
## T2 0.94819952 -0.423179432 1.00000000 0.9637727 0.34185343
## D2 0.92833604 -0.428555359 0.96377269 1.0000000 0.34243224
## U10 0.39888861 -0.292886725 0.34185343 0.3424322 1.00000000
## V10 0.40988014 -0.139920196 0.43500360 0.4683110 0.15677157
## TP 0.12437463 -0.310715188 0.19148969 0.2329119 -0.17057429
## LCC -0.03288617 -0.072495229 0.09124882 0.1843243 -0.24070874
## MCC 0.14525132 -0.307142732 0.20605611 0.2101621 -0.12633130
## HCC 0.23019018 -0.239862584 0.26314762 0.2699033 -0.07444964
## SKT 0.88635334 -0.416393861 0.96003024 0.9324701 0.26090652
## RH_950 0.18680122 -0.332949329 0.27585845 0.3713458 -0.05612313
## T_950 0.88894005 -0.400354657 0.86378443 0.8619767 0.47067278
## RH_925 0.21611920 -0.405869358 0.30219819 0.3676704 -0.05547312
## T_925 0.85305415 -0.350232568 0.82765673 0.8354953 0.45719207
## RH_850 0.16796347 -0.399022913 0.22836550 0.2647076 -0.06964610
## T_850 0.76939908 -0.218716663 0.76410634 0.7756818 0.35151703
## RH_700 0.04118724 -0.293594088 0.09996808 0.1134363 -0.07876769
## T_700 0.67602135 -0.070510714 0.68220320 0.6821273 0.21442493
## RH_500 0.09403511 -0.245685570 0.13023430 0.1367390 -0.07298243
## T_500 0.58733485 0.004315872 0.59148301 0.5743458 0.17941813
## T2_M1 0.94419536 -0.427028839 0.97794187 0.9393936 0.35542012
## T_950_M1 0.88398440 -0.411611039 0.85505911 0.8462214 0.46579184
## T_925_M1 0.85172114 -0.366731359 0.82271175 0.8228920 0.46300772
## DECLINATION 0.45237252 -0.029596156 0.44224781 0.3801482 0.08407667
## D2_M1 0.92448408 -0.440432987 0.95124248 0.9799598 0.35241699
## D2Obs 0.94193317 -0.455461577 0.93054344 0.9664760 0.38240272
## V10 TP LCC MCC HCC
## T2Obs 0.40988014 0.12437463 -0.03288617 0.145251318 0.23019018
## MSL -0.13992020 -0.31071519 -0.07249523 -0.307142732 -0.23986258
## T2 0.43500360 0.19148969 0.09124882 0.206056113 0.26314762
## D2 0.46831103 0.23291194 0.18432430 0.210162104 0.26990327
## U10 0.15677157 -0.17057429 -0.24070874 -0.126331297 -0.07444964
## V10 1.00000000 0.20778973 0.24391282 0.236058803 0.32569156
## TP 0.20778973 1.00000000 0.31979717 0.458878396 0.24949343
## LCC 0.24391282 0.31979717 1.00000000 0.254180818 0.16403613
## MCC 0.23605880 0.45887840 0.25418082 1.000000000 0.48023157
## HCC 0.32569156 0.24949343 0.16403613 0.480231570 1.00000000
## SKT 0.42663298 0.23951905 0.22575965 0.244730758 0.27416486
## RH_950 0.30877422 0.31945589 0.69139249 0.241499869 0.19327126
## T_950 0.30829153 0.10720762 -0.16829073 0.094443642 0.19841297
## RH_925 0.26975527 0.35831048 0.65505308 0.297044509 0.21018173
## T_925 0.31259035 0.09297324 -0.16427212 0.068320297 0.19325203
## RH_850 0.16899532 0.43744112 0.49746405 0.467002703 0.16251084
## T_850 0.39728664 0.08222362 -0.03616471 0.007852306 0.24965827
## RH_700 0.10825689 0.41840813 0.29728990 0.608493452 0.22022954
## T_700 0.41755364 0.09861702 0.05299987 0.036553298 0.28319161
## RH_500 0.17528982 0.28480045 0.13914094 0.616356463 0.51377097
## T_500 0.36518165 0.08031987 0.01984444 0.065538841 0.32310207
## T2_M1 0.36028397 0.16101713 0.02152608 0.167021367 0.23265653
## T_950_M1 0.24745279 0.08428835 -0.18212499 0.077195419 0.17638084
## T_925_M1 0.26073294 0.07149240 -0.18056386 0.052828052 0.17352946
## DECLINATION -0.03213909 0.03771255 -0.21109628 -0.021729648 0.05668091
## D2_M1 0.40001171 0.19830101 0.13464191 0.181090800 0.24352796
## D2Obs 0.47678192 0.20345976 0.14652865 0.195814995 0.25890049
## SKT RH_950 T_950 RH_925 T_925
## T2Obs 0.8863533 0.18680122 0.88894005 0.216119199 0.853054151
## MSL -0.4163939 -0.33294933 -0.40035466 -0.405869358 -0.350232568
## T2 0.9600302 0.27585845 0.86378443 0.302198191 0.827656727
## D2 0.9324701 0.37134581 0.86197674 0.367670428 0.835495250
## U10 0.2609065 -0.05612313 0.47067278 -0.055473121 0.457192073
## V10 0.4266330 0.30877422 0.30829153 0.269755270 0.312590355
## TP 0.2395191 0.31945589 0.10720762 0.358310478 0.092973241
## LCC 0.2257597 0.69139249 -0.16829073 0.655053077 -0.164272121
## MCC 0.2447308 0.24149987 0.09444364 0.297044509 0.068320297
## HCC 0.2741649 0.19327126 0.19841297 0.210181730 0.193252028
## SKT 1.0000000 0.38351744 0.76362561 0.381873995 0.734760823
## RH_950 0.3835174 1.00000000 -0.04652156 0.906652360 -0.071677023
## T_950 0.7636256 -0.04652156 1.00000000 0.017985727 0.977395344
## RH_925 0.3818740 0.90665236 0.01798573 1.000000000 -0.064795061
## T_925 0.7347608 -0.07167702 0.97739534 -0.064795061 1.000000000
## RH_850 0.2783907 0.55757484 0.05592327 0.698828040 -0.001843391
## T_850 0.7021015 0.03952116 0.84295487 0.001892902 0.890126622
## RH_700 0.1633512 0.34207418 -0.02901140 0.399889400 -0.061724744
## T_700 0.6400933 0.07562355 0.70508318 0.048522955 0.747717931
## RH_500 0.1528910 0.15485791 0.08770309 0.171523683 0.077657062
## T_500 0.5513347 0.02007681 0.60901726 0.004119243 0.641986519
## T2_M1 0.9371965 0.22590702 0.86236404 0.248814962 0.827904282
## T_950_M1 0.7587190 -0.05353429 0.98032614 0.007487770 0.956520188
## T_925_M1 0.7308665 -0.07722763 0.96649309 -0.062380455 0.979534164
## DECLINATION 0.4122035 -0.09919816 0.44368980 -0.065279659 0.417626106
## D2_M1 0.9224356 0.33473479 0.85860018 0.330967322 0.834474987
## D2Obs 0.8969396 0.36156717 0.85307293 0.361084178 0.826701929
## RH_850 T_850 RH_700 T_700 RH_500
## T2Obs 0.167963469 0.769399085 0.04118724 0.67602135 0.09403511
## MSL -0.399022913 -0.218716663 -0.29359409 -0.07051071 -0.24568557
## T2 0.228365501 0.764106337 0.09996808 0.68220320 0.13023430
## D2 0.264707565 0.775681835 0.11343631 0.68212734 0.13673898
## U10 -0.069646096 0.351517029 -0.07876769 0.21442493 -0.07298243
## V10 0.168995321 0.397286638 0.10825689 0.41755364 0.17528982
## TP 0.437441125 0.082223619 0.41840813 0.09861702 0.28480045
## LCC 0.497464051 -0.036164710 0.29728990 0.05299987 0.13914094
## MCC 0.467002703 0.007852306 0.60849345 0.03655330 0.61635646
## HCC 0.162510836 0.249658274 0.22022954 0.28319161 0.51377097
## SKT 0.278390678 0.702101545 0.16335123 0.64009330 0.15289102
## RH_950 0.557574843 0.039521157 0.34207418 0.07562355 0.15485791
## T_950 0.055923274 0.842954866 -0.02901140 0.70508318 0.08770309
## RH_925 0.698828040 0.001892902 0.39988940 0.04852296 0.17152368
## T_925 -0.001843391 0.890126622 -0.06172474 0.74771793 0.07765706
## RH_850 1.000000000 -0.118710690 0.59675402 -0.05735586 0.24044789
## T_850 -0.118710690 1.000000000 -0.10183030 0.89325708 0.04000299
## RH_700 0.596754017 -0.101830300 1.00000000 -0.11862092 0.41038018
## T_700 -0.057355860 0.893257084 -0.11862092 1.00000000 0.05992118
## RH_500 0.240447893 0.040002992 0.41038018 0.05992118 1.00000000
## T_500 -0.075711362 0.770578774 -0.09553026 0.89912143 0.03878061
## T2_M1 0.180628282 0.749989737 0.06321443 0.65443338 0.10908679
## T_950_M1 0.029034251 0.819861647 -0.03979655 0.67013543 0.07289999
## T_925_M1 -0.032145229 0.871372768 -0.07769362 0.72071438 0.06414881
## DECLINATION -0.013263616 0.338130424 -0.07407492 0.32886792 -0.01122715
## D2_M1 0.226791008 0.765859911 0.08582535 0.66284002 0.11957414
## D2Obs 0.268052174 0.765794084 0.10873350 0.67134133 0.13140822
## T_500 T2_M1 T_950_M1 T_925_M1 DECLINATION
## T2Obs 0.587334852 0.94419536 0.88398440 0.85172114 0.45237252
## MSL 0.004315872 -0.42702884 -0.41161104 -0.36673136 -0.02959616
## T2 0.591483005 0.97794187 0.85505911 0.82271175 0.44224781
## D2 0.574345804 0.93939358 0.84622140 0.82289201 0.38014817
## U10 0.179418126 0.35542012 0.46579184 0.46300772 0.08407667
## V10 0.365181649 0.36028397 0.24745279 0.26073294 -0.03213909
## TP 0.080319866 0.16101713 0.08428835 0.07149240 0.03771255
## LCC 0.019844435 0.02152608 -0.18212499 -0.18056386 -0.21109628
## MCC 0.065538841 0.16702137 0.07719542 0.05282805 -0.02172965
## HCC 0.323102075 0.23265653 0.17638084 0.17352946 0.05668091
## SKT 0.551334698 0.93719654 0.75871903 0.73086652 0.41220355
## RH_950 0.020076813 0.22590702 -0.05353429 -0.07722763 -0.09919816
## T_950 0.609017256 0.86236404 0.98032614 0.96649309 0.44368980
## RH_925 0.004119243 0.24881496 0.00748777 -0.06238045 -0.06527966
## T_925 0.641986519 0.82790428 0.95652019 0.97953416 0.41762611
## RH_850 -0.075711362 0.18062828 0.02903425 -0.03214523 -0.01326362
## T_850 0.770578774 0.74998974 0.81986165 0.87137277 0.33813042
## RH_700 -0.095530263 0.06321443 -0.03979655 -0.07769362 -0.07407492
## T_700 0.899121431 0.65443338 0.67013543 0.72071438 0.32886792
## RH_500 0.038780606 0.10908679 0.07289999 0.06414881 -0.01122715
## T_500 1.000000000 0.56299998 0.57860779 0.61653798 0.33501718
## T2_M1 0.562999978 1.00000000 0.87358417 0.83523137 0.46319083
## T_950_M1 0.578607789 0.87358417 1.00000000 0.97512160 0.45672756
## T_925_M1 0.616537983 0.83523137 0.97512160 1.00000000 0.42995698
## DECLINATION 0.335017182 0.46319083 0.45672756 0.42995698 1.00000000
## D2_M1 0.552458828 0.95904366 0.86190071 0.83612281 0.38729962
## D2Obs 0.564374564 0.90946289 0.83419344 0.81248000 0.33219331
## D2_M1 D2Obs
## T2Obs 0.92448408 0.9419332
## MSL -0.44043299 -0.4554616
## T2 0.95124248 0.9305434
## D2 0.97995979 0.9664760
## U10 0.35241699 0.3824027
## V10 0.40001171 0.4767819
## TP 0.19830101 0.2034598
## LCC 0.13464191 0.1465286
## MCC 0.18109080 0.1958150
## HCC 0.24352796 0.2589005
## SKT 0.92243558 0.8969396
## RH_950 0.33473479 0.3615672
## T_950 0.85860018 0.8530729
## RH_925 0.33096732 0.3610842
## T_925 0.83447499 0.8267019
## RH_850 0.22679101 0.2680522
## T_850 0.76585991 0.7657941
## RH_700 0.08582535 0.1087335
## T_700 0.66284002 0.6713413
## RH_500 0.11957414 0.1314082
## T_500 0.55245883 0.5643746
## T2_M1 0.95904366 0.9094629
## T_950_M1 0.86190071 0.8341934
## T_925_M1 0.83612281 0.8124800
## DECLINATION 0.38729962 0.3321933
## D2_M1 1.00000000 0.9529205
## D2Obs 0.95292046 1.0000000
We use the correlation matrix and the scatter plots to find out the predictor variables for our model which predicts T2.
We study and plot the relationship between T2Obs and the ECMWF model variables. The correlation coefficients of of the the variables of the dataset df_winter_00_08 with T2Obs are given below.
T2Obs MSL T2 D2 U10 V10 TP LCC
T2Obs 1.00000000 -0.413811966 0.94819952 0.9283360 0.39888861 0.40988014 0.12437463 -0.03288617
MCC HCC SKT RH_950 T_950 RH_925 T_925 RH_850
T2Obs 0.145251318 0.23019018 0.8863533 0.18680122 0.88894005 0.216119199 0.853054151 0.167963469
T_850 RH_700 T_700 RH_500 T_500 T2_M1 T_950_M1 T_925_M1
T2Obs 0.769399085 0.04118724 0.67602135 0.09403511 0.587334852 0.94419536 0.88398440 0.85172114
DECLINATION D2_M1 D2Obs
T2Obs 0.45237252 0.92448408 0.9419332
We can observe from the scatter plots that all “temperature variables” are linearly correlated with T2Obs. Please zoom the figure to see the details.
library (ggplot2)
par(mfrow=c(4,7))
for (i in 7:31) {
plot(df_winter_00_08[,i],df_winter_00_08[,6], ylab = "T2Obs", xlab = colnames(df_winter_00_08[i]))
}
For our first model, we take all the variables which have a positive correlation higher than 0.8 with T2Obs and all the variables which are negatively correlated with T2Obs.
All the “temperature variables” are linearly correlated with T2Obs but some have correlation coefficients less than 0.8. We take all the temperature variables that have higher correlation coefficients than 0.8 with T2Obs. So we use the temperature variables T2, T_950, T_925, T2_M1, T2_950_M1, T2_925_M1 and SKT in our model. D2 and D2_M1 are also have high correlation coefficient with T2Obs and are linearly distributed with T2Obs, we include D2 in our model.
Even though the RH variables are seen to be randomly distributed with T2Obs, they have correlations less than 0.8 and we exclude them from our model. Even though DECLINATION is linearly correlated with T2Obs, its correlation coefficient is less than 0.5, we exclude it from our model. In the case of TP, it is not linearly correlated with T2Obs and has very small correlation coefficient, we exclude it from the model.
From the scatter plots, we can see that the cloud cover variables LCC, MCC and HCC are randomly correlated with T2Obs and have low correlation coefficients with T2Obs. As LCC has a negative correlation with T2Obs, we take LCC to our list of predictors. As MSL also has a negative correlation coefficient, we also include it to our list of predictors. The wind velocity variables, U10 and V10 are randomly distributed and have low correlation coefficients we exclude them from our model.
Based on the above hypothesis, the formula for our initial model is ** T2Obs ~ T2 + D2 + T_950 + T_925 + T2_M1 + T_950_M1 + T_925_M1 + D2_M1 + SKT + LCC + MSL **
We use the R function “lm” to carry out the linear regression model.
The summary of our first model, my_model1_T2 shows that it has a Residual standard error: 1.428. F-statistic: 506.9 and the p-value: < 2.2e-16. The p-value shows the probability that the null hypothesis is true. If the null hypothesis is true, it is an intercept only model and there is no linear relationship between the predictand and the predictors. In our case as the p value, is very small, the null hypothesis is false and there is a linear relationship between the predictand and the predictors.
The Fischer value or F-statistic is the ratio of explained variance and unexplained variance and has a value near 1 if the null hypothesis is true. F-statistic is defined as F-statistic = MSM (Mean of Squares for Model) / MSE (Mean of Squares for Error) . In our case F-statistic is 506.9, so our model rejects the null hypothesis.
The t value is the value of the t-statistic for testing whether the corresponding regression coefficient is different from 0. As a rule of thump, a t value around or greater than 2 is good.
Pr. value is the p-value for the hypothesis test for which the t value is the test statistic. It gives the probability of a test statistic at least as the t value obtained, if the null hypothesis were true. A low Pr value for the variables shows that there is linear relationship between this variable and the predictor. The Signif codes actually shows the ordering based on p values, a lower p value, higher is the significance.
Based on the F-statistic and p value we see that our model is a good one.
my.model1.T2 <- lm(T2Obs ~ T2 + D2 + T_950 + T_925 + T2_M1 + T_950_M1 + T_925_M1 + D2_M1 + SKT + LCC + MSL , data = T2_winter_00_08_train)
summary(my.model1.T2)
##
## Call:
## lm(formula = T2Obs ~ T2 + D2 + T_950 + T_925 + T2_M1 + T_950_M1 +
## T_925_M1 + D2_M1 + SKT + LCC + MSL, data = T2_winter_00_08_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.3744 -0.5531 0.1222 0.7663 4.5287
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.695e+00 8.465e+00 0.909 0.36387
## T2 2.712e-01 1.182e-01 2.294 0.02231 *
## D2 4.895e-01 1.016e-01 4.820 2.02e-06 ***
## T_950 1.229e-01 1.264e-01 0.973 0.33129
## T_925 -1.227e-01 1.084e-01 -1.131 0.25851
## T2_M1 2.856e-01 1.061e-01 2.692 0.00738 **
## T_950_M1 5.766e-02 1.175e-01 0.491 0.62388
## T_925_M1 6.014e-02 1.087e-01 0.553 0.58044
## D2_M1 -1.460e-01 9.609e-02 -1.520 0.12932
## SKT -4.030e-02 5.794e-02 -0.695 0.48716
## LCC -1.369e+00 2.582e-01 -5.304 1.84e-07 ***
## MSL 4.347e-06 5.328e-05 0.082 0.93501
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.428 on 418 degrees of freedom
## Multiple R-squared: 0.9303, Adjusted R-squared: 0.9284
## F-statistic: 506.9 on 11 and 418 DF, p-value: < 2.2e-16
We try in this step to see whether we can improve our first model. Based on the t value and Pr value for the variables obtained in our first model, we include the variables which have Pr value less than 0.5. This results in a new formula for the linear regression as
** T2Obs ~ T2 + D2 + T_950 + T_925 + T2_M1 + D2_M1 + SKT + LCC **
The summary of the model shows that F-statistic has increased from 506 to 698. The p value remains as the same low value, the residual standard error has almost same (1.428 for the first model and 1.426 for the second model)
my.model2.T2 <- lm(T2Obs ~ T2 + D2 + T_950 + T_925 + T2_M1 + D2_M1 + SKT + LCC , data = T2_winter_00_08_train)
summary(my.model2.T2)
##
## Call:
## lm(formula = T2Obs ~ T2 + D2 + T_950 + T_925 + T2_M1 + D2_M1 +
## SKT + LCC, data = T2_winter_00_08_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.2392 -0.5785 0.1247 0.8006 4.6234
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.07884 4.49191 2.021 0.0439 *
## T2 0.26398 0.11688 2.259 0.0244 *
## D2 0.45513 0.09786 4.651 4.43e-06 ***
## T_950 0.18812 0.08061 2.334 0.0201 *
## T_925 -0.07796 0.06756 -1.154 0.2492
## T2_M1 0.30452 0.10092 3.017 0.0027 **
## D2_M1 -0.11617 0.09282 -1.252 0.2114
## SKT -0.04323 0.05760 -0.750 0.4534
## LCC -1.35913 0.25611 -5.307 1.81e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.426 on 421 degrees of freedom
## Multiple R-squared: 0.9299, Adjusted R-squared: 0.9286
## F-statistic: 698.6 on 8 and 421 DF, p-value: < 2.2e-16
We try to see if we can improve the model still further. Based on the t value and Pr value for the variables obtained in our second model, we include only the variables which have Pr value less than 0.1. This results in a new formula for the linear regression as
** T2Obs ~ T2 + D2 + T_950 + T2_M1 + LCC **
The summary of the model shows that F-statistic has increased and has a value 1114. The p value remains as the same low value, the residual standard error is the same as the first model.
my.model3.T2 <- lm(T2Obs ~ T2 + D2 + T_950 + T2_M1 + LCC , data = T2_winter_00_08_train)
summary(my.model3.T2)
##
## Call:
## lm(formula = T2Obs ~ T2 + D2 + T_950 + T2_M1 + LCC, data = T2_winter_00_08_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.4897 -0.6502 0.1423 0.7904 4.4750
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.05288 4.24875 1.895 0.05873 .
## T2 0.32245 0.08019 4.021 6.85e-05 ***
## D2 0.34431 0.05699 6.042 3.34e-09 ***
## T_950 0.11709 0.03781 3.097 0.00209 **
## T2_M1 0.19490 0.06446 3.024 0.00265 **
## LCC -1.43573 0.24820 -5.785 1.41e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.428 on 424 degrees of freedom
## Multiple R-squared: 0.9293, Adjusted R-squared: 0.9284
## F-statistic: 1114 on 5 and 424 DF, p-value: < 2.2e-16
From the above analysis we saw that we do not gain much in terms of residual error and p value when we move from model 1 to model 3. Our analysis is based only on the winter data, we cannot strongly state that the five variables for the third model are the only ones that significantly affect T2. As a compromise we select the second model with eight variables, which has a higher F value than the first model. The linear regression formula for our model is taken as
** T2Obs ~ T2 + D2 + T_950 + T_925 + T2_M1 + D2_M1 + SKT + LCC **
my.model.T2 <- lm(T2Obs ~ T2 + D2 + T_950 + T_925 + T2_M1 + D2_M1 + SKT + LCC , data = T2_winter_00_08_train)
summary(my.model.T2)
##
## Call:
## lm(formula = T2Obs ~ T2 + D2 + T_950 + T_925 + T2_M1 + D2_M1 +
## SKT + LCC, data = T2_winter_00_08_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.2392 -0.5785 0.1247 0.8006 4.6234
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.07884 4.49191 2.021 0.0439 *
## T2 0.26398 0.11688 2.259 0.0244 *
## D2 0.45513 0.09786 4.651 4.43e-06 ***
## T_950 0.18812 0.08061 2.334 0.0201 *
## T_925 -0.07796 0.06756 -1.154 0.2492
## T2_M1 0.30452 0.10092 3.017 0.0027 **
## D2_M1 -0.11617 0.09282 -1.252 0.2114
## SKT -0.04323 0.05760 -0.750 0.4534
## LCC -1.35913 0.25611 -5.307 1.81e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.426 on 421 degrees of freedom
## Multiple R-squared: 0.9299, Adjusted R-squared: 0.9286
## F-statistic: 698.6 on 8 and 421 DF, p-value: < 2.2e-16
T2 = 9.07883695 + 0.26398396 * T2 (of the ECMWF model) + 0.45513159 * D2 (of the ECMWF model) + 0.18812205 * T_950 - 0.07796135 * T_925 + 0.30452258 * T2_M1 - 0.11617459 * D2_M1 - 0.04322546 * SKT - 1.35912603 * LCC
# calculating the fit
fitted.values <- predict(object = my.model.T2, newdata <- T2_winter_00_08_test)
T2pred.mymodel <- fitted.values
fitted.values <- as.numeric(fitted.values)
residuals <- as.numeric(T2_winter_00_08_test[,1] - fitted.values)
coefficients <- my.model.T2$coefficients
rmse.mymodel.T2 <- sqrt(mean(residuals^2))
coefficients
## (Intercept) T2 D2 T_950 T_925 T2_M1
## 9.07883695 0.26398396 0.45513159 0.18812205 -0.07796135 0.30452258
## D2_M1 SKT LCC
## -0.11617459 -0.04322546 -1.35912603
rmse.mymodel.T2
## [1] 1.462304
We use the correlation matrix and the scatter plots to find out the predictor variables for our model for D2.
We study and lot the relationship between D2Obs and the ECMWF model variables. The correlation coefficients
T2Obs MSL T2 D2 U10 V10 TP LCC
D2Obs 0.94193317 -0.455461577 0.93054344 0.9664760 0.38240272 0.47678192 0.20345976 0.14652865
MCC HCC SKT RH_950 T_950 RH_925 T_925 RH_850
D2Obs 0.195814995 0.25890049 0.8969396 0.36156717 0.85307293 0.361084178 0.826701929 0.268052174
T_850 RH_700 T_700 RH_500 T_500 T2_M1 T_950_M1 T_925_M1
D2Obs 0.765794084 0.10873350 0.67134133 0.13140822 0.564374564 0.90946289 0.83419344 0.81248000
DECLINATION D2_M1 D2Obs
D2Obs 0.33219331 0.95292046 1.0000000
We can observe from the scatter plots that all “temperature variables” are linearly correlated with D2Obs. Please zoom the figure to see the details.
library (ggplot2)
par(mfrow=c(4,7))
for (i in 7:31) {
plot(df_winter_00_08[,i],df_winter_00_08[,32], ylab = "D2Obs", xlab = colnames(df_winter_00_08[i]))
}
Similar to the modelling of T2Obs, for our first model, we take all the variables which have a positive correlation higher than 0.8 with D2Obs and all the variables which are negatively correlated with D2Obs.
All the “temperature variables” are linearly correlated with D2Obs but some have correlation coefficients less than 0.8. We take all the temperature variables that have higher correlation coefficients than 0.8 with D2Obs. So we use the temperature variables T2, T_950, T_925, T2_M1, T2_950_M1, T2_925_M1 and SKT in our model. D2 and D2_M1 are also have high correlation coefficient with D2Obs and are linearly distributed with T2Obs, we include D2 in our model.
The RH variables are seen to be randomly distributed with D2Obs and they have correlations less than 0.8 and we exclude them from our model. Even though DECLINATION is linearly correlated with D2Obs, its correlation coefficient is less than 0.5, we exclude it from our model.
As the cloud cover variables LCC, MCC and HCC are randomly correlated with D2Obs and have low correlation coefficients with T2Obs. we exclude them from our model. As MSL has a negative correlation coefficient, we also include it to our list of predictors. The wind velocity variables, U10 and V10 are randomly distributed and have low correlation coefficients.
From a layman’s point of view, we have a feeling that D2 may be related to LCC, TP, RH_950 and V10 even though they are not linearly correlated with D2Obs and has very small correlation coefficient.
Based on the above hypothesis, the formula for our initial model is
** D2Obs ~ T2 + D2 + T_950 + T_925 + T2_M1 + T_950_M1 + T_925_M1 + D2_M1 + SKT + MSL + TP + RH_950 + LCC**
The summary of our first model, my_model1_D2 shows that it has a Residual standard error: 1.447. F-statistic: 486.7 and the p-value: < 2.2e-16. As the p value, is very small, the null hypothesis is false and there is a linear relationship between the predictand and the predictors.
my_model1_D2 <- lm(D2Obs ~ T2 + D2 + T_950 + T_925 + T2_M1 + T_950_M1 + T_925_M1 + D2_M1 + SKT + MSL + TP + RH_950 + + V10 + LCC, data = D2_winter_00_08_train)
summary(my_model1_D2)
##
## Call:
## lm(formula = D2Obs ~ T2 + D2 + T_950 + T_925 + T2_M1 + T_950_M1 +
## T_925_M1 + D2_M1 + SKT + MSL + TP + RH_950 + +V10 + LCC,
## data = D2_winter_00_08_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.9569 -0.7827 0.0145 0.8490 5.1294
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.160e+01 9.767e+00 1.188 0.235631
## T2 -4.629e-02 1.216e-01 -0.381 0.703674
## D2 5.465e-01 1.133e-01 4.824 1.98e-06 ***
## T_950 3.214e-01 1.297e-01 2.478 0.013608 *
## T_925 3.868e-03 1.104e-01 0.035 0.972074
## T2_M1 -1.111e-01 1.079e-01 -1.029 0.304000
## T_950_M1 -1.584e-01 1.201e-01 -1.319 0.187916
## T_925_M1 4.409e-02 1.102e-01 0.400 0.689370
## D2_M1 3.059e-01 9.887e-02 3.094 0.002109 **
## SKT 7.233e-02 5.931e-02 1.219 0.223357
## MSL -7.530e-05 6.167e-05 -1.221 0.222782
## TP -7.737e-01 4.781e-01 -1.618 0.106377
## RH_950 3.217e-02 7.219e-03 4.456 1.07e-05 ***
## V10 8.348e-02 2.304e-02 3.622 0.000328 ***
## LCC -7.320e-01 2.785e-01 -2.629 0.008891 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.447 on 415 degrees of freedom
## Multiple R-squared: 0.9426, Adjusted R-squared: 0.9407
## F-statistic: 486.7 on 14 and 415 DF, p-value: < 2.2e-16
We try in this step to see whether we can improve our first model. Based on the t value and Pr value for the variables obtained in our first model, we include the variables which have Pr value less than 0.5. Even though T2 has a Pr greater than 0.5 we include T2 in the second model also as we believe that T2 and D2 are related. As we have seen in the quantile plot of Winter data, the ECMWF model T2 deviates from the observed T2Obs widely. This may be the reason for high Pr value for T2. This results in a new formula for the linear regression as
** D2Obs ~ T2 + D2 + T_950 + T2_M1 + T_950_M1 + D2_M1 + SKT + MSL + TP + RH_950 + V10 + LCC **
The summary of the model shows that F-statistic has increased from 486.7 to 569.9. The p value remains as the same low value, the residual standard error has almost same (1.447for the first model and 1.444 for the second model)
my_model2_D2 <- lm(D2Obs ~ T2 + D2 + T_950 + T2_M1 + T_950_M1 + D2_M1 + SKT + MSL + TP + RH_950 + V10 + LCC, data = D2_winter_00_08_train)
summary(my_model2_D2)
##
## Call:
## lm(formula = D2Obs ~ T2 + D2 + T_950 + T2_M1 + T_950_M1 + D2_M1 +
## SKT + MSL + TP + RH_950 + V10 + LCC, data = D2_winter_00_08_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.9748 -0.7604 0.0283 0.8673 5.1340
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.161e+01 9.731e+00 1.193 0.233647
## T2 -5.119e-02 1.194e-01 -0.429 0.668300
## D2 5.435e-01 1.128e-01 4.820 2.02e-06 ***
## T_950 3.366e-01 9.112e-02 3.694 0.000250 ***
## T2_M1 -1.212e-01 1.056e-01 -1.148 0.251666
## T_950_M1 -1.248e-01 8.511e-02 -1.467 0.143227
## D2_M1 3.153e-01 9.758e-02 3.231 0.001330 **
## SKT 7.880e-02 5.831e-02 1.352 0.177268
## MSL -7.151e-05 6.100e-05 -1.172 0.241729
## TP -7.800e-01 4.771e-01 -1.635 0.102818
## RH_950 3.166e-02 7.157e-03 4.424 1.24e-05 ***
## V10 8.464e-02 2.286e-02 3.703 0.000242 ***
## LCC -7.352e-01 2.775e-01 -2.649 0.008372 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.444 on 417 degrees of freedom
## Multiple R-squared: 0.9425, Adjusted R-squared: 0.9409
## F-statistic: 569.9 on 12 and 417 DF, p-value: < 2.2e-16
We try in this step to see whether we can improve our first model. Based on the t value and Pr value for the variables obtained in our first model, we include the variables which have Pr value around 0.1. Even though T2 has a Pr greater than 0.5 we include T2 in the third model also as we believe that T2 and D2 are related. This results in a new formula for the linear regression as
** D2Obs ~ T2 + D2 + T_950 + T_950_M1 + D2_M1 + TP + RH_950 + V10 + LCC **
The summary of the model shows that F-statistic has increased from 569.9 to 756.7. The p value remains as the same low value, the residual standard error has almost same (1.446 for the third model and 1.444 for the second model)
my_model3_D2 <- lm(D2Obs ~ T2 + D2 + T_950 + T_950_M1 + D2_M1 + TP + RH_950 + V10 + LCC, data = D2_winter_00_08_train)
summary(my_model3_D2)
##
## Call:
## lm(formula = D2Obs ~ T2 + D2 + T_950 + T_950_M1 + D2_M1 + TP +
## RH_950 + V10 + LCC, data = D2_winter_00_08_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.0105 -0.7589 -0.0108 0.8660 5.1765
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.472731 4.564740 0.542 0.588311
## T2 -0.063511 0.055724 -1.140 0.255043
## D2 0.564682 0.097071 5.817 1.19e-08 ***
## T_950 0.364282 0.087990 4.140 4.20e-05 ***
## T_950_M1 -0.148566 0.082889 -1.792 0.073798 .
## D2_M1 0.265225 0.070523 3.761 0.000193 ***
## TP -0.603928 0.455154 -1.327 0.185274
## RH_950 0.035618 0.006496 5.483 7.22e-08 ***
## V10 0.082966 0.022751 3.647 0.000299 ***
## LCC -0.629401 0.263309 -2.390 0.017272 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.446 on 420 degrees of freedom
## Multiple R-squared: 0.9419, Adjusted R-squared: 0.9407
## F-statistic: 756.7 on 9 and 420 DF, p-value: < 2.2e-16
From the above analysis we saw that we do not gain much in terms of residual error and p value when we move from model 1 to model 3. As the model 3 has 9 variables and they have quite good Pr values we select the third model with eight variables, which has a higher F value than the other two models.
The linear regression formula for our model is taken as
** D2Obs ~ T2 + D2 + T_950 + T_950_M1 + D2_M1 + TP + RH_950 + V10 + LCC **
my.model.D2 <- lm(D2Obs ~ T2 + D2 + T_950 + T_950_M1 + D2_M1 + TP + RH_950 + V10 + LCC, data = D2_winter_00_08_train)
summary(my.model.D2)
##
## Call:
## lm(formula = D2Obs ~ T2 + D2 + T_950 + T_950_M1 + D2_M1 + TP +
## RH_950 + V10 + LCC, data = D2_winter_00_08_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.0105 -0.7589 -0.0108 0.8660 5.1765
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.472731 4.564740 0.542 0.588311
## T2 -0.063511 0.055724 -1.140 0.255043
## D2 0.564682 0.097071 5.817 1.19e-08 ***
## T_950 0.364282 0.087990 4.140 4.20e-05 ***
## T_950_M1 -0.148566 0.082889 -1.792 0.073798 .
## D2_M1 0.265225 0.070523 3.761 0.000193 ***
## TP -0.603928 0.455154 -1.327 0.185274
## RH_950 0.035618 0.006496 5.483 7.22e-08 ***
## V10 0.082966 0.022751 3.647 0.000299 ***
## LCC -0.629401 0.263309 -2.390 0.017272 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.446 on 420 degrees of freedom
## Multiple R-squared: 0.9419, Adjusted R-squared: 0.9407
## F-statistic: 756.7 on 9 and 420 DF, p-value: < 2.2e-16
D2 = 2.472731 - 0.063511 * T2 (of the ECMWF model) + 0.564682 * D2 (of the ECMWF model) + 0.364282 * T_950 - 0.148566* T_950_M1 + 0.265225 *D2_M1 - 0.603928* TP + 0.035618 * RH_950 + 0.082966 * V10 - 0.629401* LCC
# calculating the fit
fitted.values <- predict(object = my.model.D2, newdata <- D2_winter_00_08_test)
D2pred.mymodel <- fitted.values
fitted.values <- as.numeric(fitted.values)
residuals <- as.numeric(D2_winter_00_08_test[,1] - fitted.values)
coefficients <- my.model.D2$coefficients
rmse.mymodel.D2 <- sqrt(mean(residuals^2))
coefficients
## (Intercept) T2 D2 T_950 T_950_M1 D2_M1
## 2.47273050 -0.06351071 0.56468186 0.36428231 -0.14856581 0.26522469
## TP RH_950 V10 LCC
## -0.60392832 0.03561787 0.08296641 -0.62940091
rmse.mymodel.D2
## [1] 1.351384
In order to analyze my.model, we compare it with models produced by three other linear regression models. The first one is known as the “lm model” which uses R function lm with all the variables of the dataset.
The second is the “lmstep”" model which uses the R “step” algorithm. This algorithm develops a sequence of linear models, removing or adding a variable at each step. A model is updated when it has a lower Akaike’s Information Criterion (AIC) than the original model. AIC is defined as $AIC = = -N , log(frac{RSS}{N}) + 2k $ where N is the number of observations, RSS is the residual sum of squares and k is the number of degrees of freedom of the original model.
The third method is the glmcv, Gaussian linear regression model with cross validation. Here the training dataset is divided to K equal parts and always one part is set aside as test data. The idea behind the K-fold cross validation is that the training of the linear regression model is done with K-1 parts and cross validation (prediction with cross validation data) is done with the part which is set aside. The prediction error is calculated for this part. In the next iteration the set-aside part is included in the training data which forms a K-1 group and a new part is set aside. The iterations continue in a modular fashion, each time K-1 parts used for training and one part is used for cross validation. For each iteration prediction error is calculated and glmcv choose the iteration with the least error and selects that model.
We then use the models of lm, lmstep and glmcv to predict T2 and D2 using the test data. We use the root mean square error (RMSE) to compare the different models.
For the lm model we use all the model variables. With this model we get Residual standard error: 1.413, F-statistic: 228.9 and p-value: < 2.2e-16
lm.model.T2 <- lm(T2Obs ~ ., data = T2_winter_00_08_train)
summary(lm.model.T2)
##
## Call:
## lm(formula = T2Obs ~ ., data = T2_winter_00_08_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.2034 -0.6131 0.1082 0.7946 4.4716
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.788e+01 1.111e+01 1.609 0.108370
## MSL -4.625e-05 7.012e-05 -0.660 0.509912
## T2 1.954e-01 1.218e-01 1.604 0.109479
## D2 4.390e-01 1.151e-01 3.815 0.000158 ***
## U10 1.569e-02 2.537e-02 0.618 0.536789
## V10 6.873e-02 2.387e-02 2.880 0.004192 **
## TP -1.122e+00 5.206e-01 -2.155 0.031723 *
## LCC -1.438e+00 2.996e-01 -4.800 2.24e-06 ***
## MCC -9.167e-02 2.640e-01 -0.347 0.728581
## HCC 1.279e-02 2.031e-01 0.063 0.949846
## SKT -1.651e-02 5.941e-02 -0.278 0.781277
## RH_950 1.252e-02 1.336e-02 0.938 0.349002
## T_950 2.053e-01 1.535e-01 1.338 0.181739
## RH_925 -1.790e-02 1.235e-02 -1.449 0.148033
## T_925 -3.156e-01 1.474e-01 -2.141 0.032904 *
## RH_850 1.170e-02 5.390e-03 2.171 0.030530 *
## T_850 3.434e-02 6.016e-02 0.571 0.568491
## RH_700 -4.971e-04 3.395e-03 -0.146 0.883646
## T_700 1.454e-02 4.713e-02 0.308 0.757918
## RH_500 2.578e-04 2.995e-03 0.086 0.931455
## T_500 2.594e-02 3.145e-02 0.825 0.409956
## T2_M1 3.035e-01 1.072e-01 2.830 0.004885 **
## T_950_M1 1.147e-01 1.196e-01 0.959 0.338050
## T_925_M1 8.026e-02 1.102e-01 0.728 0.466943
## DECLINATION 1.130e-02 1.592e-02 0.710 0.478269
## D2_M1 -1.190e-01 9.902e-02 -1.201 0.230269
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.413 on 404 degrees of freedom
## Multiple R-squared: 0.9341, Adjusted R-squared: 0.93
## F-statistic: 228.9 on 25 and 404 DF, p-value: < 2.2e-16
We now see the predicting ability of lm.model.T2 by using it to predict T2 for the test data T2_winter_00_08_test. The results show that the root mean square error (rmse) is 1.411888.
# calculating the fit
fitted.values <- predict(object = lm.model.T2, newdata <- T2_winter_00_08_test)
T2pred.lm <- fitted.values
fitted.values <- as.numeric(fitted.values)
residuals <- as.numeric(T2_winter_00_08_test[,1] - fitted.values)
coefficients <- lm.model.T2$coefficients
rmse.lm.T2 <- sqrt(mean(residuals^2))
coefficients
## (Intercept) MSL T2 D2 U10
## 1.787541e+01 -4.624988e-05 1.954199e-01 4.390128e-01 1.568595e-02
## V10 TP LCC MCC HCC
## 6.872884e-02 -1.122150e+00 -1.437913e+00 -9.167406e-02 1.278564e-02
## SKT RH_950 T_950 RH_925 T_925
## -1.650709e-02 1.252312e-02 2.053390e-01 -1.789853e-02 -3.156258e-01
## RH_850 T_850 RH_700 T_700 RH_500
## 1.170128e-02 3.433758e-02 -4.971310e-04 1.453740e-02 2.577569e-04
## T_500 T2_M1 T_950_M1 T_925_M1 DECLINATION
## 2.594027e-02 3.035225e-01 1.147081e-01 8.025665e-02 1.130134e-02
## D2_M1
## -1.189730e-01
rmse.lm.T2
## [1] 1.411888
For the lm model we use all the model variables. With this model we get Residual standard error: 1.438, F-statistic: 276.2 and p-value: < 2.2e-16
lm.model.D2 <- lm(D2Obs ~ ., data = D2_winter_00_08_train)
summary(lm.model.D2)
##
## Call:
## lm(formula = D2Obs ~ ., data = D2_winter_00_08_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.1028 -0.7146 -0.0346 0.8332 5.1312
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.370e+00 1.131e+01 -0.121 0.903687
## MSL -4.015e-05 7.141e-05 -0.562 0.574226
## T2 -5.310e-02 1.241e-01 -0.428 0.668872
## D2 5.394e-01 1.172e-01 4.603 5.59e-06 ***
## U10 3.411e-02 2.584e-02 1.320 0.187504
## V10 6.797e-02 2.430e-02 2.797 0.005406 **
## TP -5.540e-01 5.302e-01 -1.045 0.296657
## LCC -8.296e-01 3.051e-01 -2.719 0.006821 **
## MCC -3.327e-02 2.688e-01 -0.124 0.901581
## HCC -8.632e-02 2.069e-01 -0.417 0.676689
## SKT 9.765e-02 6.050e-02 1.614 0.107305
## RH_950 4.590e-02 1.360e-02 3.374 0.000811 ***
## T_950 4.304e-01 1.563e-01 2.754 0.006157 **
## RH_925 -1.996e-02 1.258e-02 -1.587 0.113247
## T_925 -1.621e-01 1.501e-01 -1.080 0.280926
## RH_850 9.344e-03 5.489e-03 1.702 0.089452 .
## T_850 3.326e-02 6.126e-02 0.543 0.587477
## RH_700 -1.310e-03 3.457e-03 -0.379 0.704838
## T_700 -9.826e-03 4.800e-02 -0.205 0.837896
## RH_500 7.553e-04 3.050e-03 0.248 0.804524
## T_500 2.791e-02 3.202e-02 0.872 0.383929
## T2_M1 -8.806e-02 1.092e-01 -0.806 0.420497
## T_950_M1 -1.349e-01 1.218e-01 -1.108 0.268599
## T_925_M1 4.932e-02 1.122e-01 0.439 0.660583
## DECLINATION -4.616e-02 1.621e-02 -2.847 0.004644 **
## D2_M1 2.821e-01 1.008e-01 2.797 0.005400 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.438 on 404 degrees of freedom
## Multiple R-squared: 0.9447, Adjusted R-squared: 0.9413
## F-statistic: 276.2 on 25 and 404 DF, p-value: < 2.2e-16
We now see the predicting ability of lm.model.D2 by using it to predict D2 for the test data D2_winter_00_08_test. The results show that the root mean square error (rmse) is 1.305783.
# calculating the fit
fitted.values <- predict(object = lm.model.D2, newdata <- D2_winter_00_08_test)
D2pred.lm <- fitted.values
fitted.values <- as.numeric(fitted.values)
residuals <- as.numeric(D2_winter_00_08_test[,1] - fitted.values)
coefficients <- lm.model.D2$coefficients
rmse.lm.D2 <- sqrt(mean(residuals^2))
coefficients
## (Intercept) MSL T2 D2 U10
## -1.369675e+00 -4.015087e-05 -5.309726e-02 5.394137e-01 3.411170e-02
## V10 TP LCC MCC HCC
## 6.797341e-02 -5.540012e-01 -8.295798e-01 -3.326516e-02 -8.632013e-02
## SKT RH_950 T_950 RH_925 T_925
## 9.764619e-02 4.589526e-02 4.304398e-01 -1.996047e-02 -1.621084e-01
## RH_850 T_850 RH_700 T_700 RH_500
## 9.344400e-03 3.326178e-02 -1.310375e-03 -9.825790e-03 7.552656e-04
## T_500 T2_M1 T_950_M1 T_925_M1 DECLINATION
## 2.791360e-02 -8.806281e-02 -1.349101e-01 4.932023e-02 -4.615555e-02
## D2_M1
## 2.820616e-01
rmse.lm.D2
## [1] 1.305783
As explained earlier this algorithm develops a sequence of linear models, removing or adding a variable at each step. lm.step model starts with a formula known as the object.formula. We can define a lower and upper formulas and the model can go in step backward or forward to the lower or upper formulas from the object formula in some number of steps defined in the step variable. Here we define the object.formula as a constant one (no variables present), i.e., T2Obs ~1. The lower.formula is the same as the object.formula and the upper.formula is the formula with all the variables, i.e, T2Obs ~. We use only the forward direction.
With this model we get Residual standard error: 1.397, F-statistic: 531.2 and p-value: < 2.2e-16
# generating the linear models
response <- colnames(T2_winter_00_08_train)[1]
object.formula <- as.formula(paste(response, " ~ 1", sep = ""))
upper.formula <- as.formula(paste(response, " ~ .", sep = ""))
lower.formula <- as.formula(paste(response, " ~ 1", sep = ""))
object.lm <- lm(object.formula, data = T2_winter_00_08_train)
upper.lm <- lm(upper.formula, data = T2_winter_00_08_train)
lower.lm <- lm(lower.formula, data = T2_winter_00_08_train)
direction <- "forward"
steps <- 1000
# choosing the best model
step.model.T2 <- step(object = object.lm,
scope = list(upper = upper.lm, lower = lower.lm),
direction = direction, trace = 0 , steps = steps)
step.model.T2
##
## Call:
## lm(formula = T2Obs ~ T2 + T_950_M1 + LCC + D2 + V10 + T2_M1 +
## TP + RH_850 + T_500 + T_925 + D2_M1, data = T2_winter_00_08_train)
##
## Coefficients:
## (Intercept) T2 T_950_M1 LCC D2
## 9.823552 0.194058 0.209485 -1.606888 0.462476
## V10 T2_M1 TP RH_850 T_500
## 0.071111 0.296620 -1.175379 0.006691 0.034410
## T_925 D2_M1
## -0.084216 -0.138672
summary(step.model.T2)
##
## Call:
## lm(formula = T2Obs ~ T2 + T_950_M1 + LCC + D2 + V10 + T2_M1 +
## TP + RH_850 + T_500 + T_925 + D2_M1, data = T2_winter_00_08_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.1462 -0.6120 0.0862 0.8179 4.5285
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.823552 4.529546 2.169 0.03066 *
## T2 0.194058 0.101660 1.909 0.05696 .
## T_950_M1 0.209485 0.058579 3.576 0.00039 ***
## LCC -1.606888 0.254824 -6.306 7.30e-10 ***
## D2 0.462476 0.098471 4.697 3.59e-06 ***
## V10 0.071111 0.022136 3.212 0.00142 **
## T2_M1 0.296620 0.100797 2.943 0.00343 **
## TP -1.175379 0.463145 -2.538 0.01152 *
## RH_850 0.006691 0.003119 2.145 0.03250 *
## T_500 0.034410 0.017177 2.003 0.04580 *
## T_925 -0.084216 0.054377 -1.549 0.12220
## D2_M1 -0.138672 0.092717 -1.496 0.13550
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.397 on 418 degrees of freedom
## Multiple R-squared: 0.9332, Adjusted R-squared: 0.9315
## F-statistic: 531.2 on 11 and 418 DF, p-value: < 2.2e-16
We now see the predicting ability of lm.step.T2 by using it to predict T2 for the test data T2_winter_00_08_test. The results show that the root mean square error (rmse) is 1.422109.
# calculating the fit
fitted.values <- predict(object = step.model.T2, newdata <- T2_winter_00_08_test)
T2pred.lmstep <- fitted.values
fitted.values <- as.numeric(fitted.values)
residuals <- as.numeric(T2_winter_00_08_test[,1] - fitted.values)
coefficients <- step.model.T2$coefficients
rmse.lmstep.T2 <- sqrt(mean(residuals^2))
coefficients
## (Intercept) T2 T_950_M1 LCC D2
## 9.823552427 0.194057743 0.209484845 -1.606888038 0.462475765
## V10 T2_M1 TP RH_850 T_500
## 0.071110983 0.296620115 -1.175378598 0.006691133 0.034409528
## T_925 D2_M1
## -0.084215592 -0.138671532
rmse.lmstep.T2
## [1] 1.422109
Similar to the T2 model, we define the object.formula as a constant one (no variables present), i.e., D2Obs ~1. The lower.formula is the same as the object.formula and the upper.formula is the formula with all the variables, i.e, D2Obs ~. We use only the forward direction.
With this model we get Residual standard error: 1.428, F-statistic: 777.9 and p-value: < 2.2e-16. Notice that T2 is not present in the final formula which gave the best AIC.
# generating the linear models
response <- colnames(D2_winter_00_08_train)[1]
object.formula <- as.formula(paste(response, " ~ 1", sep = ""))
upper.formula <- as.formula(paste(response, " ~ .", sep = ""))
lower.formula <- as.formula(paste(response, " ~ 1", sep = ""))
object.lm <- lm(object.formula, data = D2_winter_00_08_train)
upper.lm <- lm(upper.formula, data = D2_winter_00_08_train)
lower.lm <- lm(lower.formula, data = D2_winter_00_08_train)
direction <- "forward"
steps <- 1000
# choosing the best model
step.model.D2 <- step(object = object.lm,
scope = list(upper = upper.lm, lower = lower.lm),
direction = direction, trace = 0 , steps = steps)
summary(step.model.D2)
##
## Call:
## lm(formula = D2Obs ~ D2 + U10 + DECLINATION + T_950 + RH_950 +
## LCC + V10 + D2_M1 + T_950_M1, data = D2_winter_00_08_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.0791 -0.6686 -0.0301 0.8325 5.2744
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.239479 5.164038 -0.434 0.664752
## D2 0.561928 0.082665 6.798 3.66e-11 ***
## U10 0.043221 0.023492 1.840 0.066496 .
## DECLINATION -0.042510 0.014883 -2.856 0.004500 **
## T_950 0.333876 0.086727 3.850 0.000137 ***
## RH_950 0.031571 0.006411 4.924 1.22e-06 ***
## LCC -0.632813 0.267905 -2.362 0.018628 *
## V10 0.062783 0.022688 2.767 0.005903 **
## D2_M1 0.244156 0.069842 3.496 0.000523 ***
## T_950_M1 -0.142431 0.081859 -1.740 0.082600 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.428 on 420 degrees of freedom
## Multiple R-squared: 0.9434, Adjusted R-squared: 0.9422
## F-statistic: 777.9 on 9 and 420 DF, p-value: < 2.2e-16
We now see the predicting ability of lm.step.D2 by using it to predict D2 for the test data D2_winter_00_08_test. The results show that the root mean square error (rmse) is 1.334799.
# calculating the fit
fitted.values <- predict(object = step.model.D2, newdata <- D2_winter_00_08_test)
D2pred.lmstep <- fitted.values
fitted.values <- as.numeric(fitted.values)
residuals <- as.numeric(D2_winter_00_08_test[,1] - fitted.values)
coefficients <- step.model.D2$coefficients
rmse.lmstep.D2 <- sqrt(mean(residuals^2))
coefficients
## (Intercept) D2 U10 DECLINATION T_950 RH_950
## -2.23947895 0.56192768 0.04322126 -0.04250985 0.33387578 0.03157108
## LCC V10 D2_M1 T_950_M1
## -0.63281298 0.06278252 0.24415608 -0.14243140
rmse.lmstep.D2
## [1] 1.334799
We use the glm function in the boot package. We use K = 4 folds, the data for training will be around 300 observations and for cross validation is around 100 observations. The cross validation error is smallest for the first iteration.
library(boot)
set.seed(123)
cv.error.10 = rep(0, 10)
for (i in 1:10) {
glm.fit.T2 = glm(T2Obs ~., data = T2_winter_00_08_train)
cv.error.10[i] = cv.glm(T2_winter_00_08_train, glm.fit.T2, K = 4)$delta[1]
cv.error.10
}
cv.error.10
## [1] 2.143964 2.263606 2.187350 2.202544 2.285312 2.162794 2.227936
## [8] 2.363532 2.399627 2.274742
summary(glm.fit.T2)
##
## Call:
## glm(formula = T2Obs ~ ., data = T2_winter_00_08_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5.2034 -0.6131 0.1082 0.7946 4.4716
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.788e+01 1.111e+01 1.609 0.108370
## MSL -4.625e-05 7.012e-05 -0.660 0.509912
## T2 1.954e-01 1.218e-01 1.604 0.109479
## D2 4.390e-01 1.151e-01 3.815 0.000158 ***
## U10 1.569e-02 2.537e-02 0.618 0.536789
## V10 6.873e-02 2.387e-02 2.880 0.004192 **
## TP -1.122e+00 5.206e-01 -2.155 0.031723 *
## LCC -1.438e+00 2.996e-01 -4.800 2.24e-06 ***
## MCC -9.167e-02 2.640e-01 -0.347 0.728581
## HCC 1.279e-02 2.031e-01 0.063 0.949846
## SKT -1.651e-02 5.941e-02 -0.278 0.781277
## RH_950 1.252e-02 1.336e-02 0.938 0.349002
## T_950 2.053e-01 1.535e-01 1.338 0.181739
## RH_925 -1.790e-02 1.235e-02 -1.449 0.148033
## T_925 -3.156e-01 1.474e-01 -2.141 0.032904 *
## RH_850 1.170e-02 5.390e-03 2.171 0.030530 *
## T_850 3.434e-02 6.016e-02 0.571 0.568491
## RH_700 -4.971e-04 3.395e-03 -0.146 0.883646
## T_700 1.454e-02 4.713e-02 0.308 0.757918
## RH_500 2.578e-04 2.995e-03 0.086 0.931455
## T_500 2.594e-02 3.145e-02 0.825 0.409956
## T2_M1 3.035e-01 1.072e-01 2.830 0.004885 **
## T_950_M1 1.147e-01 1.196e-01 0.959 0.338050
## T_925_M1 8.026e-02 1.102e-01 0.728 0.466943
## DECLINATION 1.130e-02 1.592e-02 0.710 0.478269
## D2_M1 -1.190e-01 9.902e-02 -1.201 0.230269
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 1.99557)
##
## Null deviance: 12225.28 on 429 degrees of freedom
## Residual deviance: 806.21 on 404 degrees of freedom
## AIC: 1544.6
##
## Number of Fisher Scoring iterations: 2
We now see the predicting ability of glmcv.T2 by using it to predict T2 for the test data D2_winter_00_08_test. The results show that the root mean square error (rmse) is 1.411888.
# calculating the fit
fitted.values <- predict(object = glm.fit.T2, newdata <- T2_winter_00_08_test)
T2pred.glmcv <- fitted.values
fitted.values <- as.numeric(fitted.values)
residuals <- as.numeric(T2_winter_00_08_test[,1] - fitted.values)
coefficients <- glm.fit.T2$coefficients
rmse.glmcv.T2 <- sqrt(mean(residuals^2))
coefficients
## (Intercept) MSL T2 D2 U10
## 1.787541e+01 -4.624988e-05 1.954199e-01 4.390128e-01 1.568595e-02
## V10 TP LCC MCC HCC
## 6.872884e-02 -1.122150e+00 -1.437913e+00 -9.167406e-02 1.278564e-02
## SKT RH_950 T_950 RH_925 T_925
## -1.650709e-02 1.252312e-02 2.053390e-01 -1.789853e-02 -3.156258e-01
## RH_850 T_850 RH_700 T_700 RH_500
## 1.170128e-02 3.433758e-02 -4.971310e-04 1.453740e-02 2.577569e-04
## T_500 T2_M1 T_950_M1 T_925_M1 DECLINATION
## 2.594027e-02 3.035225e-01 1.147081e-01 8.025665e-02 1.130134e-02
## D2_M1
## -1.189730e-01
rmse.glmcv.T2
## [1] 1.411888
Similar to T2 modelling using glm we carry out the modelling for D2
library(boot)
set.seed(123)
cv.error.10 = rep(0, 10)
for (i in 1:10) {
glm.fit.D2 = glm(D2Obs ~., data = D2_winter_00_08_train)
cv.error.10[i] = cv.glm(D2_winter_00_08_train, glm.fit.D2, K = 5)$delta[1]
cv.error.10
}
cv.error.10
## [1] 2.232088 2.253692 2.297469 2.343139 2.303318 2.279148 2.321005
## [8] 2.407905 2.353452 2.230696
summary(glm.fit.D2)
##
## Call:
## glm(formula = D2Obs ~ ., data = D2_winter_00_08_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5.1028 -0.7146 -0.0346 0.8332 5.1312
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.370e+00 1.131e+01 -0.121 0.903687
## MSL -4.015e-05 7.141e-05 -0.562 0.574226
## T2 -5.310e-02 1.241e-01 -0.428 0.668872
## D2 5.394e-01 1.172e-01 4.603 5.59e-06 ***
## U10 3.411e-02 2.584e-02 1.320 0.187504
## V10 6.797e-02 2.430e-02 2.797 0.005406 **
## TP -5.540e-01 5.302e-01 -1.045 0.296657
## LCC -8.296e-01 3.051e-01 -2.719 0.006821 **
## MCC -3.327e-02 2.688e-01 -0.124 0.901581
## HCC -8.632e-02 2.069e-01 -0.417 0.676689
## SKT 9.765e-02 6.050e-02 1.614 0.107305
## RH_950 4.590e-02 1.360e-02 3.374 0.000811 ***
## T_950 4.304e-01 1.563e-01 2.754 0.006157 **
## RH_925 -1.996e-02 1.258e-02 -1.587 0.113247
## T_925 -1.621e-01 1.501e-01 -1.080 0.280926
## RH_850 9.344e-03 5.489e-03 1.702 0.089452 .
## T_850 3.326e-02 6.126e-02 0.543 0.587477
## RH_700 -1.310e-03 3.457e-03 -0.379 0.704838
## T_700 -9.826e-03 4.800e-02 -0.205 0.837896
## RH_500 7.553e-04 3.050e-03 0.248 0.804524
## T_500 2.791e-02 3.202e-02 0.872 0.383929
## T2_M1 -8.806e-02 1.092e-01 -0.806 0.420497
## T_950_M1 -1.349e-01 1.218e-01 -1.108 0.268599
## T_925_M1 4.932e-02 1.122e-01 0.439 0.660583
## DECLINATION -4.616e-02 1.621e-02 -2.847 0.004644 **
## D2_M1 2.821e-01 1.008e-01 2.797 0.005400 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 2.069254)
##
## Null deviance: 15126.62 on 429 degrees of freedom
## Residual deviance: 835.98 on 404 degrees of freedom
## AIC: 1560.2
##
## Number of Fisher Scoring iterations: 2
We now see the predicting ability of glmcv.D2 by using it to predict D2 for the test data D2_winter_00_08_test. The results show that the root mean square error (rmse) is 1.305783.
# calculating the fit
fitted.values <- predict(object = glm.fit.D2, newdata <- D2_winter_00_08_test)
D2pred.glmcv <- fitted.values
fitted.values <- as.numeric(fitted.values)
residuals <- as.numeric(D2_winter_00_08_test[,1] - fitted.values)
coefficients <- glm.fit.D2$coefficients
rmse.glmcv.D2 <- sqrt(mean(residuals^2))
coefficients
## (Intercept) MSL T2 D2 U10
## -1.369675e+00 -4.015087e-05 -5.309726e-02 5.394137e-01 3.411170e-02
## V10 TP LCC MCC HCC
## 6.797341e-02 -5.540012e-01 -8.295798e-01 -3.326516e-02 -8.632013e-02
## SKT RH_950 T_950 RH_925 T_925
## 9.764619e-02 4.589526e-02 4.304398e-01 -1.996047e-02 -1.621084e-01
## RH_850 T_850 RH_700 T_700 RH_500
## 9.344400e-03 3.326178e-02 -1.310375e-03 -9.825790e-03 7.552656e-04
## T_500 T2_M1 T_950_M1 T_925_M1 DECLINATION
## 2.791360e-02 -8.806281e-02 -1.349101e-01 4.932023e-02 -4.615555e-02
## D2_M1
## 2.820616e-01
rmse.glmcv.D2
## [1] 1.305783
We can see that my.model for T2 and D2 have comparable rmse values to that obtained using the lm, lmstep and glmcv.
cat("rmse mymodel T2=", rmse.mymodel.T2, "rmse lm T2=", rmse.lm.T2, "rmse lmstep T2=", rmse.lmstep.T2, "rmse glmcv T2=", rmse.glmcv.T2,
"rmse mymodel D2=", rmse.mymodel.D2, "rmse lm D2=", rmse.lm.D2,"rmse lmstep D2=", rmse.lmstep.D2, "rmse glmcv D2=", rmse.glmcv.D2
)
## rmse mymodel T2= 1.462304 rmse lm T2= 1.411888 rmse lmstep T2= 1.422109 rmse glmcv T2= 1.411888 rmse mymodel D2= 1.351384 rmse lm D2= 1.305783 rmse lmstep D2= 1.334799 rmse glmcv D2= 1.305783
We plot the following graphs Residuals vs Fitted values, Normal QQ-plot, Standardized residuals vs Fitted values and Residuals vs Leverage for all the models.
The Residuals vs Fitted values plot shows if the residuals have non-linear patterns. There could be a non-linear relationship between predictor variables and the response variable and the pattern could show up in this plot if the model is not able to capture the non-linear relationship. Ideally there should not be any pattern and the residuals should be randomly spread around a horizontal line drawn at 0 on the y-axis. A red line in the graph helps to show the trend in the pattern, its deviation from the horizontal line at 0.
The second graph, the normal QQ-plot shows if the residuals are normally distributed. If the residuals follow the straight dashed line, the residuals are normally distributed.
The third graph, Standardized residuals vs Fitted values, shows the rmse in the predictions. The third graph, Residuals vs Leverage graph helps to find the influential extreme values or outliers. The data have extreme values, but we can consider these extreme values are not influential if the linear regression model we find does not change if we either include or exclude them from analysis. The red dotted line in the graph shows the Cook’s distance. If all the points in the leverage graph are well within the the Cook’s distance, then there are no outliers that influence the linear regression.
We plot the diagnostic plots for our linear regression models, mymodel, lm, lmstep and glmcv. For mymodel, the Residuals vs Fitted values shows most of the residuals are along the mean value. The normal QQ plot is linear and the leverage plot do not have outliers. These graphs show that the mymodel is a good fit for the weather data. Similar graphs for lm, lmstep and glmcv show that they also produce good linear regression models for the weather data.
# We plot the graphs Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage.
par(mfrow = c(2,2))
plot(my.model.T2, which = c(1,2,3,5), caption = list("T2-my model Residuals vs Fitted values", "T2-mymodel Normal QQ-plot", "T2 Standardized residuals vs Fitted values", "T2-mymodel Residuals vs Leverage"))
par(mfrow = c(2,2))
plot(my.model.D2, which = c(1,2,5), caption = list("D2-my model Residuals vs Fitted values", "D2-lm Normal QQ-plot", "D2-mymodel Residuals vs Leverage"))
par(mfrow = c(2,2))
plot(lm.model.T2, which = c(1,2,5), caption = list("T2-lm Residuals vs Fitted values", "T2-lm Normal QQ-plot", "T2-lm Residuals vs Leverage"))
par(mfrow = c(2,2))
plot(lm.model.D2, which = c(1,2,5), caption = list("D2-lm Residuals vs Fitted values", "D2-lm Normal QQ-plot", "D2-lm Residuals vs Leverage"))
par(mfrow = c(2,2))
plot(step.model.T2, which = c(1,2,5), caption = list("T2-lmstep Residuals vs Fitted values", "T2-lmstep Normal QQ-plot", "T2-lmstep Residuals vs Leverage"))
par(mfrow = c(2,2))
plot(step.model.D2, which = c(1,2,5), caption = list("D2-lmstep Residuals vs Fitted values", "D2-lmstep Normal QQ-plot", "D2-lmstep Residuals vs Leverage"))
par(mfrow = c(2,2))
plot(glm.fit.T2, which = c(1,2,5), caption = list("T2-glm Residuals vs Fitted values", "T2-glm Normal QQ-plot", "T2-glm Residuals vs Leverage"))
par(mfrow = c(2,2))
plot(glm.fit.D2, which = c(1,2,5), caption = list("D2-glm Residuals vs Fitted values", "D2-glm Normal QQ-plot", "D2-glm Residuals vs Leverage"))
## RH predictions
The aim of my assignment was to find the interdependence of the variables T2 and D2. We find this by calculating the Relative humidity which is a function of T2 and D2.
In this section, we calculate the RH values using the T2Obs, D2Obs, T2 and D2 from the ECMWF model and T2 and D2 obtained from the linear regression models mymodel, lm, lmstep and glmcv
T2Obs.test <- T2_winter_00_08_test[,1]
D2Obs.test <- D2_winter_00_08_test[,1]
T2model.test <- T2_winter_00_08_test[,3]
D2model.test <- D2_winter_00_08_test[,4]
# RH calculation
# RH from T2Obs and D2Obs
saturation_humidity_obs <- 6.112*exp((17.62*T2Obs.test)/(T2Obs.test + 243.12))
specific_humidity_obs <- 6.112*exp((17.62*D2Obs.test)/(D2Obs.test + 243.12))
RHObs <- 100*(specific_humidity_obs/saturation_humidity_obs)
# RH from T2 and D2 from ECMWF model
saturation_humidity_model <- 6.112*exp((17.62*T2model.test)/(T2Obs.test + 243.12))
specific_humidity_model <- 6.112*exp((17.62*D2model.test)/(D2Obs.test + 243.12))
RHmodel <- 100*(specific_humidity_model/saturation_humidity_model)
# T2 and D2 predicted using mymodel
saturation_humidity_pred_mymodel <- 6.112*exp((17.62*T2pred.mymodel)/(T2pred.mymodel + 243.12))
specific_humidity_pred_mymodel <- 6.112*exp((17.62*D2pred.mymodel)/(D2pred.mymodel + 243.12))
RHpred.mymodel <- 100*(specific_humidity_pred_mymodel/saturation_humidity_pred_mymodel)
# T2 and D2 predicted using lm
saturation_humidity_pred_lm <- 6.112*exp((17.62*T2pred.lm)/(T2pred.lm + 243.12))
specific_humidity_pred_lm <- 6.112*exp((17.62*D2pred.lm)/(D2pred.lm + 243.12))
RHpred.lm <- 100*(specific_humidity_pred_lm/saturation_humidity_pred_lm)
# T2 and D2 predicted using lmstep
saturation_humidity_pred_lmstep <- 6.112*exp((17.62*T2pred.lmstep)/(T2pred.lmstep + 243.12))
specific_humidity_pred_lmstep <- 6.112*exp((17.62*D2pred.lmstep)/(D2pred.lmstep + 243.12))
RHpred.lmstep <- 100*(specific_humidity_pred_lmstep/saturation_humidity_pred_lmstep)
# T2 and D2 predicted using glm cross validation
saturation_humidity_pred_glmcv <- 6.112*exp((17.62*T2pred.glmcv)/(T2pred.glmcv + 243.12))
specific_humidity_pred_glmcv <- 6.112*exp((17.62*D2pred.glmcv)/(D2pred.glmcv + 243.12))
RHpred.glmcv <- 100*(specific_humidity_pred_lmstep/saturation_humidity_pred_glmcv)
We get the following rmse values for the different models rmse RH ECMWF model = 4.823471 rmse RM mymodel = 2.000381 rmse RM lm = 2.029639 rmse RH lmstep = 1.980861 rmse RH glmcv = 1.983662.
We can see that the ECMWF model has high rmse in RH value compared to all other models. The model we developed “mymodel” has comparable rmse values for RH than that of lm, lmstep and glmcv models. This experiment shows that the statistical preprocessing improves the forecast as it reduces the rmse compared to the raw model forecast.
Recall that the rmse for T2 and D2 for the linear regression were around 1.4 in all the models. The rmse for RH is much for than (even double) that of T2 and D2.
rmse.RH.model <- sqrt(mean((RHmodel - RHObs)^2))
rmse.RH.mymodel <- sqrt(mean((RHpred.mymodel - RHObs)^2))
rmse.RH.lm <- sqrt(mean((RHpred.lm - RHObs)^2))
rmse.RH.lmstep <- sqrt(mean((RHpred.lmstep - RHObs)^2))
rmse.RH.glmcv <- sqrt(mean((RHpred.glmcv - RHObs)^2))
cat("rmse RH ECMWF model = ", rmse.RH.model, "rmse RM mymodel = ", rmse.RH.mymodel, "rmse RM lm = ", rmse.RH.lm, "rmse RH lmstep = ", rmse.RH.lmstep, "rmse RH glmcv = ", rmse.RH.glmcv)
## rmse RH ECMWF model = 4.823471 rmse RM mymodel = 2.000381 rmse RM lm = 1.920225 rmse RH lmstep = 1.980861 rmse RH glmcv = 1.983662
This section is divided into (1) model prediction and (2) analysis and results. We predict the linear regression model for the Kumpula station for the four seasons and for the 64 forecast periods. For ease of analysis we use only the forecasts that are at 00:00 UTC (analysis_time is 1). We use the simple linear regression(lm), linear regression with step method and linear regression with cross validation (glm.cv).
The linear regression model takes two response variables namely T2Obs and D2Obs, the observations of air temperature and dewpoint temperature at 2 meters. As the predictors we use the ECMWF model outputs. All these variables are explained in the Section on weather data. Using the linear regression models, mymodel, lm, lmstep and glmcv, we predict the T2 and D2, the air temperature and dewpoint temperature for 64 forecast periods. We then calculate the relative humidity, RH, using the predicted T2 and D2 using the four linear regression models. We also calculate the RH using the T2 and D2 from the ECMWF raw model. The relative humidity is calculated using some standard formulas.
The R program is done in a modular way with functions.
# Packages
# reshape package for melt operation
library(reshape2)
# glm and glm.cv
library(boot)
FitWithmymodel <- function(training.set, test.set = NULL, parameter) {
# Estimates a linear model using "mymodel" we developed and predicts with the model.
#
# Args:
# training.set:
# test.set:
# parameter shows whether we want to model for T2 or D2.
# introducing the parameter is a simple way to select the formula that is to be
# used for T2 and D2
#
# Returns:
# A list of coefficients, residuals and fitted values.
response <- colnames(training.set)[1]
if (parameter == 1) {
model <- lm(T2Obs ~ T2 + D2 + T_950 + T_925 + T2_M1 + D2_M1 + SKT + LCC, data = training.set )
}
else if (parameter == 2) {
model <- lm(D2Obs ~ T2 + D2 + T_950 + T_950_M1 + D2_M1 + TP + RH_950 + V10 + LCC, data = training.set)
}
if (is.null(test.set)) {
test.set <- training.set
}
fitted.values <- predict(object = model, newdata = test.set)
residuals <- test.set[,1] - fitted.values
fitted.values <- as.numeric(fitted.values)
residuals <- as.numeric(residuals)
results <- list("coefficients" = model$coefficients,
"residuals" = residuals,
"fitted.values" = fitted.values)
class(results) <- "simplelm"
return(results)
}
FitWithLM <- function(training.set, test.set = NULL) {
# Estimates a full linear model and predicts with the model.
#
# Args:
# training.set:
# test.set:
#
# Returns:
# A list of coefficients, residuals and fitted values.
response <- colnames(training.set)[1]
formula <- paste(response, " ~ .", sep = "")
model <- lm(formula = formula, data = training.set)
if (is.null(test.set)) {
test.set <- training.set
}
fitted.values <- predict(object = model, newdata = test.set)
residuals <- test.set[,1] - fitted.values
fitted.values <- as.numeric(fitted.values)
residuals <- as.numeric(residuals)
results <- list("coefficients" = model$coefficients,
"residuals" = residuals,
"fitted.values" = fitted.values)
class(results) <- "simplelm"
return(results)
}
FitWithStep <- function(training.set, test.set = NULL,
object.formula, upper.formula, lower.formula,
direction, steps = 1000) {
# Chooses a linear model by a stepwise algorithm and predicts a fit on an independant test data
#
# Args:
# training.set: a data matrix
# test.set: a data matrix
# object.formula: a formula viable for a linear model
# upper.formula: a formula viable for a linear model
# lower.formula: a formula viable for a linear model
# direction: both/forward/backward
# steps: the number of steps the algorithm is allowed to take
#
# Returns:
# A list of coefficients, residuals and fitted values.
# saving the training and the test set to global environment
training.set <- training.set
test.set <- test.set
# generating the linear models
object.lm <- lm(object.formula, data = training.set)
upper.lm <- lm(upper.formula, data = training.set)
lower.lm <- lm(lower.formula, data = training.set)
# choosing the best model
step.model <- step(object = object.lm,
scope = list(upper = upper.lm, lower = lower.lm),
direction = direction, trace = 0 , steps = steps)
# if there are no test data, the fit is calculated using the training data
if (is.null(test.set)) {
test.set <- training.set
}
# calculating the fit
fitted.values <- predict(object = step.model, newdata <- test.set)
fitted.values <- as.numeric(fitted.values)
residuals <- as.numeric(test.set[,1] - fitted.values)
coefficients <- step.model$coefficients
results <- list("coefficients" = coefficients,
"residuals" = residuals,
"fitted.values" = fitted.values)
class(results) <- "simplelm"
return(results)
}
FitWithglmcv<- function(training.set, test.set = NULL, k) {
# Estimates a full glm linear model with cross validation and predicts with the model.
#
# Args:
# training.set:
# test.set:
# number of folds
# Returns:
# A list of coefficients, residuals and fitted values.
response <- colnames(training.set)[1]
formula <- paste(response, " ~ .", sep = "")
model <- lm(formula = formula, data = training.set)
set.seed(123)
cv.error.10 = rep(0, 10)
for (i in 1:10) {
model = glm(formula = formula, data = training.set)
cv.error.10[i] = cv.glm(training.set, model, K = k)$delta[1]
cv.error.10
}
if (is.null(test.set)) {
test.set <- training.set
}
fitted.values <- predict(object = model, newdata = test.set)
residuals <- test.set[,1] - fitted.values
fitted.values <- as.numeric(fitted.values)
residuals <- as.numeric(residuals)
results <- list("coefficients" = model$coefficients,
"residuals" = residuals,
"fitted.values" = fitted.values)
class(results) <- "simplelm"
return(results)
}
ModelsOfAlgorithms <- function(training.set, test.set, parameter) {
# Models generated using different algorithms, currently with lm, lmstep and glmcv
#
#
# Args:
# training.set:
# test.set:
#
# Returns:
# A list of elements ' fitted_models', 'residuals', and 'response' variable
# model details
response <- colnames(training.set)[1]
fInitial <- as.formula(paste(response, " ~ 1", sep = ""))
fFull <- as.formula(paste(response, " ~ .", sep = ""))
# Create a list to store the model results for T2 and D2 observations
fitted.values.by.algorithm <- as.data.frame(matrix(NA, ncol = 0,
nrow=nrow(test.set)))
residuals.by.algorithm <- as.data.frame(matrix(NA, ncol = 0,
nrow = nrow(test.set)))
elapsed.time.by.algorithm <- NULL
# mymodel
tmp.time.result <- system.time(
tmp.result <- FitWithmymodel(
training.set, test.set, parameter
)
)
fitted.values.by.algorithm$mymodel <- tmp.result$fitted.values
residuals.by.algorithm$mymodel <- tmp.result$residuals
elapsed.time.by.algorithm <- c(elapsed.time.by.algorithm,
mymodel = tmp.time.result[[3]])
# lm
tmp.time.result <- system.time(
tmp.result <- FitWithLM(
training.set, test.set
)
)
fitted.values.by.algorithm$lm <- tmp.result$fitted.values
residuals.by.algorithm$lm <- tmp.result$residuals
elapsed.time.by.algorithm <- c(elapsed.time.by.algorithm,
lm = tmp.time.result[[3]])
# lm step model
tmp.time.result <- system.time(
tmp.result <- FitWithStep(
training.set,
test.set,
fInitial, fFull, fInitial,
direction = "forward", steps = 1000)
)
fitted.values.by.algorithm$lmstep <- tmp.result$fitted.values
residuals.by.algorithm$lmstep <- tmp.result$residuals
elapsed.time.by.algorithm<- c(elapsed.time.by.algorithm,
lmstep =tmp.time.result[[3]])
# glm cv model
tmp.time.result <- system.time(
tmp.result <- FitWithglmcv(
training.set,
test.set,
k = 4
)
)
fitted.values.by.algorithm$glmcv <- tmp.result$fitted.values
residuals.by.algorithm$glmcv <- tmp.result$residuals
elapsed.time.by.algorithm <- c(elapsed.time.by.algorithm,
glmcv = tmp.time.result[[3]])
results <- list(fitted.values = fitted.values.by.algorithm,
residuals = residuals.by.algorithm,
elapsed.time = elapsed.time.by.algorithm)
return(results)
}
CalculateRH <- function(wdata) {
# Calculates the relative humidity RH using various fitting algorithms
#
# Args
# df: dataframe
#
# Returns:
# rmse.pred.by.model: rmse of lmstep and glmcv predictions
# rmse.RH.by.model: rmse RH values from the ECMWF, lm, lmstep and glmcv
# Create a list to store the rmse values for predictions and RH using the models values,
# predicted values using lm, lmstep, glmcv
rmse.pred.by.model <- NULL
rmse.RH.by.model <- NULL
# Separating T2 and D2 and generating T2_Obs and model data as well as D2_Obs and model data
T2_Obs_model_data <- wdata[, c(6:(ncol(wdata)-1))]
D2_Obs_model_data <- cbind(wdata[ncol(wdata)], wdata[,7:(ncol(wdata)-1)])
# Splitting into train and test data
n <- nrow(T2_Obs_model_data)
train = sample(1:n, size = round(0.75*n), replace=FALSE)
T2_Obs_model_data_train <- T2_Obs_model_data[train,]
T2_Obs_model_data_test <- T2_Obs_model_data[-train,]
D2_Obs_model_data_train <- D2_Obs_model_data[train,]
D2_Obs_model_data_test <- D2_Obs_model_data[-train,]
# Models and predictions for T2 and D2. Parameter value selects the formula for mymodel
results.T2 <- ModelsOfAlgorithms(T2_Obs_model_data_train, T2_Obs_model_data_test, parameter = 1)
results.D2 <- ModelsOfAlgorithms(D2_Obs_model_data_train, D2_Obs_model_data_test, parameter = 2)
# T2 and D2 observation and model data
T2_obs_test <- T2_Obs_model_data_test[,1]
D2_obs_test <- D2_Obs_model_data_test[,1]
T2_model_test <- T2_Obs_model_data_test[,3]
D2_model_test <- D2_Obs_model_data_test[,4]
# T2 and D2 from predictions using mymodel, lm, lmstep and glmcv
T2_pred_mymodel <- results.T2$fitted.values$mymodel
T2_pred_lm <- results.T2$fitted.values$lm
T2_pred_lmstep <- results.T2$fitted.values$lmstep
T2_pred_glmcv<- results.T2$fitted.values$glmcv
D2_pred_mymodel <- results.D2$fitted.values$mymodel
D2_pred_lm <- results.D2$fitted.values$lms
D2_pred_lmstep <- results.D2$fitted.values$lmstep
D2_pred_glmcv<- results.D2$fitted.values$glmcv
# rmse for T2 and D2 obtained using the rawmodel, mymodel, lm, lmstep and glmcv models
rmse.T2.raw <- sqrt(mean((T2_obs_test - T2_model_test)^2))
rmse.T2.mymodel <- sqrt(mean(results.T2$residuals$mymodel^2))
rmse.T2.lm <- sqrt(mean(results.T2$residuals$lm^2))
rmse.T2.lm <- sqrt(mean(results.T2$residuals$lm^2))
rmse.T2.lmstep <- sqrt(mean(results.T2$residuals$lmstep^2))
rmse.T2.glmcv <- sqrt(mean(results.T2$residuals$glmcv^2))
rmse.D2.raw <- sqrt(mean((D2_obs_test - D2_model_test)^2))
rmse.D2.mymodel <- sqrt(mean(results.D2$residuals$mymodel^2))
rmse.D2.lm <- sqrt(mean(results.D2$residuals$lm^2))
rmse.D2.lmstep <- sqrt(mean(results.D2$residuals$lmstep^2))
rmse.D2.glmcv <- sqrt(mean(results.D2$residuals$glmcv^2))
rmse.pred.by.model<- c(rmse.pred.by.model,
T2raw = rmse.T2.raw,
T2mymodel = rmse.T2.mymodel,
T2lm = rmse.T2.lm,
T2lmstep = rmse.T2.lmstep,
T2glmcv = rmse.T2.glmcv,
D2raw = rmse.D2.raw,
D2mymodel = rmse.D2.mymodel,
D2lm = rmse.D2.lm,
D2lmstep = rmse.D2.lmstep,
D2glmcv = rmse.D2.glmcv
)
# Relative humidity (RH) calculation
saturation_humidity_obs <- 6.112*exp((17.62*T2_obs_test)/(T2_obs_test + 243.12))
specific_humidity_obs <- 6.112*exp((17.62*D2_obs_test)/(D2_obs_test + 243.12))
RH_obs <- 100*(specific_humidity_obs/saturation_humidity_obs)
saturation_humidity_model <- 6.112*exp((17.62*T2_model_test)/(T2_obs_test + 243.12))
specific_humidity_model <- 6.112*exp((17.62*D2_model_test)/(D2_obs_test + 243.12))
RH_model <- 100*(specific_humidity_model/saturation_humidity_model)
# predicted T2, D2 using mymodel, lm, lmstep and glmcv
saturation_humidity_pred_mymodel <- 6.112*exp((17.62*T2_pred_mymodel)/(T2_pred_mymodel + 243.12))
specific_humidity_pred_mymodel <- 6.112*exp((17.62*D2_pred_mymodel)/(D2_pred_mymodel + 243.12))
RH_pred_mymodel <- 100*(specific_humidity_pred_mymodel/saturation_humidity_pred_mymodel)
saturation_humidity_pred_lm <- 6.112*exp((17.62*T2_pred_lm)/(T2_pred_lm + 243.12))
specific_humidity_pred_lm <- 6.112*exp((17.62*D2_pred_lm)/(D2_pred_lm + 243.12))
RH_pred_lm <- 100*(specific_humidity_pred_lm/saturation_humidity_pred_lm)
saturation_humidity_pred_lmstep <- 6.112*exp((17.62*T2_pred_lmstep)/(T2_pred_lmstep + 243.12))
specific_humidity_pred_lmstep <- 6.112*exp((17.62*D2_pred_lmstep)/(D2_pred_lmstep + 243.12))
RH_pred_lmstep <- 100*(specific_humidity_pred_lmstep/saturation_humidity_pred_lmstep)
saturation_humidity_pred_glmcv <- 6.112*exp((17.62*T2_pred_glmcv)/(T2_pred_glmcv + 243.12))
specific_humidity_pred_glmcv <- 6.112*exp((17.62*D2_pred_glmcv)/(D2_pred_glmcv + 243.12))
RH_pred_glmcv <- 100*(specific_humidity_pred_glmcv/saturation_humidity_pred_glmcv)
# rmse for RH
rmse_RH_model <- sqrt(mean((RH_model - RH_obs)^2))
rmse_RH_mymodel <- sqrt(mean((RH_pred_mymodel - RH_obs)^2))
rmse_RH_lm <- sqrt(mean((RH_pred_lm - RH_obs)^2))
rmse_RH_lmstep <- sqrt(mean((RH_pred_lmstep - RH_obs)^2))
rmse_RH_glmcv <- sqrt(mean((RH_pred_glmcv - RH_obs)^2))
rmse.RH.by.model<- c(rmse.RH.by.model,
rawmodelRH = rmse_RH_model,
mymodelRH = rmse_RH_mymodel,
lmRH = rmse_RH_lm,
lmstepRH = rmse_RH_lmstep,
glmcvRH = rmse_RH_glmcv
)
#cat("RH model rmse=", rmse_RH_model, "RH lm rmse=", rmse_RH_lm,
# "RH lmstep rmse=", rmse_RH_lmstep, "RH glmcv rmse=", rmse_RH_glmcv)
results <- list(rmse.pred.by.model = rmse.pred.by.model,
rmse.RH.by.model = rmse.RH.by.model)
return(results)
}
### The Main Program
# Find the lm, lmstep and glmcv, linear regression models for Kumpula station data.
# We predict the T2 and D2, the air temperature and dewpoint temperature at 2 meters
# using the above models. We calculate the relative humidly (RH) with the predicted T2 and D2.
# We compare T2, D2 and RH of the ECMWF model with the pred-cited T2, D2 and RH.
# Initializing variables and data structures
station_id <- 2998
analysis_time <- 1
rmse.T2D2 <- NULL
rmse.RH <- NULL
stn.season.atime.fperiod <- NULL
rmse.pred.station.season.atime.fperiod <- NULL
rmse.rh.station.season.atime.fperiod <- NULL
# loading the Kumpula data from the file data/station2998_T2_D2_Obs_model_data_with_timestamps.csv.
# This data is created using our program "create_weather_data.R"
T2_D2_Obs_model_data_with_timestamps = read.table(file = "data/station2998_T2_D2_Obs_model_data_with_timestamps.csv", sep = ",", header = TRUE)
# We separately model the data for the four seasons and 64 forecast periods (2:64)
for (s in 1:4) {
for(f in 2:65) {
df <- T2_D2_Obs_model_data_with_timestamps[which(T2_D2_Obs_model_data_with_timestamps$season == s &
T2_D2_Obs_model_data_with_timestamps$analysis_time == 1 &
T2_D2_Obs_model_data_with_timestamps$forecast_period == f ),]
# The function CalculateRH, calculate the relative humidity from the predicted T2 and D2 using the linear models.
results <- CalculateRH(df)
rmse.T2D2 <- rbind(rmse.T2D2, results$rmse.pred.by.model)
rmse.RH <- rbind(rmse.RH, results$rmse.RH.by.model)
stn.season.atime.fperiod <- c(station = station_id,
season = s,
atime = analysis_time,
fperiod = f
)
rmse.pred.station.season.atime.fperiod <- rbind(rmse.pred.station.season.atime.fperiod, c(stn.season.atime.fperiod, results$rmse.pred.by.model))
rmse.rh.station.season.atime.fperiod <- rbind(rmse.rh.station.season.atime.fperiod, c(stn.season.atime.fperiod, results$rmse.RH.by.model))
}
}
df.rmse.pred <- as.data.frame(rmse.pred.station.season.atime.fperiod)
for (s in 1:4) {
dfname <- paste("df_",s,sep ="")
assign(dfname, df.rmse.pred[which(df.rmse.pred$season == s),])
}
# Winter
summary(df_1[,5:14])
## T2raw T2mymodel T2lm T2lmstep
## Min. :1.314 Min. :1.167 Min. :1.102 Min. :1.126
## 1st Qu.:1.887 1st Qu.:1.597 1st Qu.:1.612 1st Qu.:1.610
## Median :2.289 Median :2.028 Median :2.065 Median :2.004
## Mean :2.654 Mean :2.360 Mean :2.377 Mean :2.357
## 3rd Qu.:3.415 3rd Qu.:3.080 3rd Qu.:3.121 3rd Qu.:3.090
## Max. :4.953 Max. :4.366 Max. :4.377 Max. :4.361
## T2glmcv D2raw D2mymodel D2lm
## Min. :1.102 Min. :1.294 Min. :1.090 Min. :1.109
## 1st Qu.:1.612 1st Qu.:1.957 1st Qu.:1.824 1st Qu.:1.807
## Median :2.065 Median :2.449 Median :2.357 Median :2.410
## Mean :2.377 Mean :2.914 Mean :2.722 Mean :2.772
## 3rd Qu.:3.121 3rd Qu.:3.924 3rd Qu.:3.744 3rd Qu.:3.796
## Max. :4.377 Max. :5.660 Max. :4.977 Max. :5.118
## D2lmstep D2glmcv
## Min. :1.095 Min. :1.109
## 1st Qu.:1.846 1st Qu.:1.807
## Median :2.409 Median :2.410
## Mean :2.750 Mean :2.772
## 3rd Qu.:3.795 3rd Qu.:3.796
## Max. :5.079 Max. :5.118
# Spring
summary(df_2[,5:14])
## T2raw T2mymodel T2lm T2lmstep
## Min. :1.384 Min. :1.040 Min. :1.014 Min. :0.9916
## 1st Qu.:2.116 1st Qu.:1.466 1st Qu.:1.482 1st Qu.:1.4602
## Median :2.446 Median :1.932 Median :1.946 Median :1.9413
## Mean :2.643 Mean :2.083 Mean :2.079 Mean :2.0663
## 3rd Qu.:3.067 3rd Qu.:2.516 3rd Qu.:2.514 3rd Qu.:2.5251
## Max. :4.215 Max. :3.937 Max. :3.725 Max. :3.6880
## T2glmcv D2raw D2mymodel D2lm
## Min. :1.014 Min. :1.408 Min. :1.164 Min. :1.192
## 1st Qu.:1.482 1st Qu.:2.150 1st Qu.:1.870 1st Qu.:1.917
## Median :1.946 Median :2.639 Median :2.502 Median :2.550
## Mean :2.079 Mean :2.867 Mean :2.667 Mean :2.700
## 3rd Qu.:2.514 3rd Qu.:3.475 3rd Qu.:3.344 3rd Qu.:3.374
## Max. :3.725 Max. :4.987 Max. :4.621 Max. :4.494
## D2lmstep D2glmcv
## Min. :1.196 Min. :1.192
## 1st Qu.:1.944 1st Qu.:1.917
## Median :2.529 Median :2.550
## Mean :2.684 Mean :2.700
## 3rd Qu.:3.355 3rd Qu.:3.374
## Max. :4.463 Max. :4.494
# Summer
summary(df_3[,5:14])
## T2raw T2mymodel T2lm T2lmstep
## Min. :1.287 Min. :1.065 Min. :1.002 Min. :0.9936
## 1st Qu.:1.805 1st Qu.:1.395 1st Qu.:1.370 1st Qu.:1.3366
## Median :2.237 Median :1.933 Median :1.838 Median :1.8141
## Mean :2.327 Mean :1.996 Mean :1.935 Mean :1.9286
## 3rd Qu.:2.794 3rd Qu.:2.515 3rd Qu.:2.475 3rd Qu.:2.4486
## Max. :3.912 Max. :3.491 Max. :3.325 Max. :3.3101
## T2glmcv D2raw D2mymodel D2lm
## Min. :1.002 Min. :1.048 Min. :0.8599 Min. :0.8654
## 1st Qu.:1.370 1st Qu.:1.738 1st Qu.:1.5890 1st Qu.:1.6299
## Median :1.838 Median :2.260 Median :2.1856 Median :2.1852
## Mean :1.935 Mean :2.508 Mean :2.2926 Mean :2.3053
## 3rd Qu.:2.475 3rd Qu.:3.150 3rd Qu.:2.9152 3rd Qu.:2.9241
## Max. :3.325 Max. :5.043 Max. :4.1308 Max. :4.0558
## D2lmstep D2glmcv
## Min. :0.8583 Min. :0.8654
## 1st Qu.:1.6431 1st Qu.:1.6299
## Median :2.1973 Median :2.1852
## Mean :2.2922 Mean :2.3053
## 3rd Qu.:2.8709 3rd Qu.:2.9241
## Max. :4.0577 Max. :4.0558
# Autumn
summary(df_4[,5:14])
## T2raw T2mymodel T2lm T2lmstep
## Min. :1.169 Min. :1.066 Min. :1.018 Min. :0.9544
## 1st Qu.:1.522 1st Qu.:1.385 1st Qu.:1.389 1st Qu.:1.3824
## Median :1.990 Median :1.924 Median :1.946 Median :1.9283
## Mean :2.180 Mean :2.044 Mean :2.053 Mean :2.0400
## 3rd Qu.:2.648 3rd Qu.:2.581 3rd Qu.:2.488 3rd Qu.:2.5093
## Max. :4.642 Max. :4.021 Max. :4.015 Max. :3.9945
## T2glmcv D2raw D2mymodel D2lm
## Min. :1.018 Min. :0.9917 Min. :0.8148 Min. :0.8798
## 1st Qu.:1.389 1st Qu.:1.5181 1st Qu.:1.4796 1st Qu.:1.5017
## Median :1.946 Median :2.3790 Median :2.1101 Median :2.1440
## Mean :2.053 Mean :2.5061 Mean :2.2894 Mean :2.3138
## 3rd Qu.:2.488 3rd Qu.:3.2584 3rd Qu.:2.8153 3rd Qu.:2.8787
## Max. :4.015 Max. :5.4251 Max. :4.4240 Max. :4.4736
## D2lmstep D2glmcv
## Min. :0.8224 Min. :0.8798
## 1st Qu.:1.4863 1st Qu.:1.5017
## Median :2.1379 Median :2.1440
## Mean :2.2904 Mean :2.3138
## 3rd Qu.:2.8396 3rd Qu.:2.8787
## Max. :4.5307 Max. :4.4736
df.rmse.rh <- as.data.frame(rmse.rh.station.season.atime.fperiod)
for (s in 1:4) {
dfname <- paste("df_",s,sep ="")
assign(dfname, df.rmse.rh[which(df.rmse.rh$season == s),])
}
# Winter
summary(df_1[,5:9])
## rawmodelRH mymodelRH lmRH lmstepRH
## Min. : 3.110 Min. :1.373 Min. :1.356 Min. :1.373
## 1st Qu.: 5.404 1st Qu.:2.318 1st Qu.:2.326 1st Qu.:2.322
## Median : 6.683 Median :2.634 Median :2.578 Median :2.609
## Mean : 6.737 Mean :2.790 Mean :2.723 Mean :2.724
## 3rd Qu.: 7.722 3rd Qu.:3.227 3rd Qu.:3.137 3rd Qu.:3.111
## Max. :10.804 Max. :4.802 Max. :4.571 Max. :4.689
## glmcvRH
## Min. :1.339
## 1st Qu.:2.207
## Median :2.475
## Mean :2.656
## 3rd Qu.:3.065
## Max. :4.428
# Spring
summary(df_2[,5:9])
## rawmodelRH mymodelRH lmRH lmstepRH
## Min. : 5.093 Min. :1.747 Min. :1.687 Min. :1.673
## 1st Qu.: 7.816 1st Qu.:2.916 1st Qu.:2.863 1st Qu.:2.849
## Median :10.135 Median :3.894 Median :3.784 Median :3.812
## Mean :10.141 Mean :3.969 Mean :3.919 Mean :3.913
## 3rd Qu.:11.898 3rd Qu.:4.792 3rd Qu.:4.732 3rd Qu.:4.733
## Max. :17.399 Max. :7.019 Max. :6.879 Max. :6.845
## glmcvRH
## Min. :1.655
## 1st Qu.:2.826
## Median :3.768
## Mean :3.905
## 3rd Qu.:4.672
## Max. :6.921
# Summer
summary(df_3[,5:9])
## rawmodelRH mymodelRH lmRH lmstepRH
## Min. : 4.069 Min. :1.537 Min. :1.452 Min. :1.449
## 1st Qu.: 7.496 1st Qu.:2.710 1st Qu.:2.738 1st Qu.:2.739
## Median : 9.545 Median :3.383 Median :3.370 Median :3.334
## Mean : 9.510 Mean :3.649 Mean :3.624 Mean :3.605
## 3rd Qu.:11.905 3rd Qu.:4.693 3rd Qu.:4.607 3rd Qu.:4.624
## Max. :15.656 Max. :5.683 Max. :5.852 Max. :5.766
## glmcvRH
## Min. :1.432
## 1st Qu.:2.706
## Median :3.311
## Mean :3.623
## 3rd Qu.:4.575
## Max. :5.837
# Autumn
summary(df_4[,5:9])
## rawmodelRH mymodelRH lmRH lmstepRH
## Min. : 3.115 Min. :1.377 Min. :1.253 Min. :1.249
## 1st Qu.: 4.545 1st Qu.:1.853 1st Qu.:1.814 1st Qu.:1.816
## Median : 5.852 Median :2.250 Median :2.206 Median :2.234
## Mean : 5.831 Mean :2.365 Mean :2.334 Mean :2.311
## 3rd Qu.: 6.798 3rd Qu.:2.727 3rd Qu.:2.635 3rd Qu.:2.608
## Max. :10.256 Max. :4.360 Max. :4.244 Max. :4.098
## glmcvRH
## Min. :1.235
## 1st Qu.:1.821
## Median :2.196
## Mean :2.289
## 3rd Qu.:2.672
## Max. :4.149
We plot the rmse in predictions of T2 and D2 using the T2 and D2 obtained from ECMWF model and also from the linear models mymodel, lm, lmstep and glmcv for the winter data. The plots show that the rmse in T2 and D2 are smaller in our linear predictions models than the T2 and D2 from ECMWF model. The model mymodel predictions are comparable that of other linear regression models.
require(ggplot2)
require(reshape)
df.rmse.pred <- as.data.frame(rmse.pred.station.season.atime.fperiod)
df_winter_pred_model <- df.rmse.pred[which(df.rmse.pred$season == 1),]
df_winter_T2_pred_model <- df_winter_pred_model[, c(4:9)]
df_winter_D2_pred_model <- df_winter_pred_model[, c(4,10:14)]
fperiod <- df_winter_T2_pred_model$fperiod
molten <- melt(df_winter_T2_pred_model, id.vars = 'fperiod', variable_name = "model")
ggplot(molten, aes(x=fperiod, y=value, colour = model )) + geom_line()+xlab("forecast period") + ylab("RMSE") + ggtitle("RMSE in T2 prediction with linear regression models for the winter season")
molten <- melt(df_winter_D2_pred_model, id.vars = 'fperiod', variable_name = "model")
ggplot(molten, aes(x=fperiod, y=value, colour = model )) + geom_line()+xlab("forecast period") + ylab("RMSE") + ggtitle("RMSE in D2 prediction with linear regression models for the winter season")
We plot the rmse in RH which depends on the T2 and D2. We plot the RH based on the T2 and D2 obtained from ECMWF model and also from the linear models mymodel, lm, lmstep and glmcv. We see that the the errors in T2 and D2 are amplified in RH. The plots show that the rmse is reduced by at most half in our linear predictions model.
df.rmse.rh <- as.data.frame(rmse.rh.station.season.atime.fperiod)
df_winter_rh_model <- df.rmse.rh[which(df.rmse.rh$season == 1),]
df_winter_rh_model <- df_winter_rh_model[, c(4:9)]
molten <- melt(df_winter_rh_model, id.vars = 'fperiod', variable_name = "model")
ggplot(molten, aes(x=fperiod, y=value, colour = model )) + geom_line()+xlab("forecast period") + ylab("RMSE") + ggtitle("RMSE in RH with linear regression models for the winter season")
The three plots below give the quantile values of prediction errors in T2, D2 and RH in all seasons. we can see that mymodel has the lowest quantile values for T2 and similar values for D2. The median RH value for the ECMWF model is almost double as the RH value obtained from the models.
df.rmse.pred.station.fperiod <- df.rmse.pred[,c(2, 4:9)]
molten <- melt(df.rmse.pred.station.fperiod , id.vars = c("season", "fperiod"), variable_name = "model")
ggplot(molten, aes(x= season, y=value, colour = model )) + geom_boxplot()+xlab("seasons(1:4)") + ylab("RMSE") + ggtitle("RMSE in T2 with linear regression models for four seasons")
df.rmse.pred.station.fperiod <- df.rmse.pred[,c(2, 4,10:14)]
molten <- melt(df.rmse.pred.station.fperiod , id.vars = c("season", "fperiod"), variable_name = "model")
ggplot(molten, aes(x= season, y=value, colour = model )) + geom_boxplot()+xlab("seasons(1:4)") + ylab("RMSE") + ggtitle("RMSE in D2 with linear regression models for four seasons")
df.rmse.rh.station.fperiod <- df.rmse.rh[,c(2, 4:9)]
molten <- melt(df.rmse.rh.station.fperiod , id.vars = c("season", "fperiod"), variable_name = "model")
ggplot(molten, aes(x= season, y=value, colour = model )) + geom_boxplot()+xlab("seasons(1:4)") + ylab("RMSE") + ggtitle("RMSE in RH with linear regression models for four seasons")
Statistical postprocessing is a method to improve the direct model output from numerical weather prediction (NWP) models. In this assignment we use linear regression models that use observed air temperature at 2 meters (T2Obs) and the dewpoint temperature at 2 meters (D2Obs) to improve the forecast capability of the NWP models. We develop a linear regression model called mymodel, for the air temperature at 2 meters (T2) and the dew point temperature at 2 meters (D2). We compare our model with the standard models available in R such as lm, step and glm. We then use the predicted T2 and D2 to calculate the relative humidity. the root mean square error (rmse) for the RH for ECMWF NWP model and for the linear models. The results show that linear regression models reduces the rmse in RH at most by a factor of 2. The data used for analysis is the weather data for the Kumpula weather station for the four seasons, for 64 forecast periods produced by ECMWF and the T2Obs and D2Obs available at the Finnish Meteorological Institute.
Our study shows that statistical postprocessing improves the prediction capability of a model. We observed that the prediction errors in variable is amplified/influenced by the prediction errors in its dependent variables. In this study we use only linear regression models. In future we would like to experiment the problem of interdependence with principle component analysis (PCA) and principle component regression (PCR) to predict the variables and their dependent variables.
I am grateful to the POSSE project manager Mikko Rauhala (mikko.rauhala@fmi.fi) for allowing me to present the work I have been doing in the project as my final assignment for the IODS course. I am grateful to Jussi Ylhäisi for giving the problem of finding the interdependence of variables, providing me with the data on Kumpula weather station and for the everyday discussions on weather data and modelling. My deep gratitude goes to the IODS course instructors, my fellow students whose assignments I have peer reviewed and who have reviewed my course work. They have taught me how to understand a data science problem, comprehend the variables in the problem and analyze the problem deeply.